Projections using Hypertuned model through XGboost

All data is from FanGraphs. I have no affiliation with FanGraphs, but please consider contributing to their website if you found this project informative.

1 Project Scope

1.1 Objective

This project is designed to showcase how Using a Percentile Based Worth System values Fantasy Baseball Players through a Inning Pitched (IP) weighted projection

The Categories used for prediction valuation are year-end rankings for the following metrics: - HRs - Runs - RBIs - Batting Average - Stolen Bases


2 Processing the Data

2.1 Getting Data Into R

2.1.1 Load Libraries

First we need to load the packages that R needs to run the analysis

library(sqldf) #SQL in R
library(skimr) #Summaries and useful for removing low % data
library(ggplot2) #Plotting Functions
library(plyr) #slightly deprecated data cleaning
library(dplyr) #slightly updated data cleaning
library(tidyverse) #tidyverse data cleaning universe
library(caret) #wrapper for creating, tuning and validating models
library(xgboost) #package for creating regression tree model
library(vtreat) # useful package for treating data before modeling 
library(Matrix)
library(Boruta)
library(mgcv)
library(moments) #for measuring skewness
library(data.table) #alternative to dplyr we use to create lags
library(pdp) #partial dependence graphs
library(vip) #variable importance 
library(grid) #put multiple plots on one grid
library(gridExtra) #additional grid functionality
library(janitor) #one function used to clean transposed data set
library(ggpubr) #for qq plot
library(tableHTML)
library(kableExtra)

The # comments generally explain what additional functionality each library adds to R

2.1.2 Load in Data

All data is downloaded from Fan Graphs. From this location. The data is also available on my Github here. There are player level and team data sets


#data read-in
Batter_data <- read_csv("FanGraphs Leaderboard_Hitting50PA.csv")
#Team datasets
FDG_Team = read_csv("FanGraphs Leaderboard_Team.csv")
#Create a prefix for all team stats that starts with T_
FDG_Team2 <- FDG_Team %>% 
  rename_with( ~ paste0("T_", .x))

2.1.3 Checking Team Data

str give information about an object, while skim provides a customizable summary


#Output not shown for space
#str(FDG_Team2)

skim(FDG_Team2) %>%  
  tibble::as_tibble()

2.2 Understanding the Dataset

2.2.1 Exploring the dataset

skim let’s us see how the data was imported into R. Documentation can be found here


#Full Dataset dimensions

skimr::skim(Batter_data) %>% 
  tibble::as_tibble() %>% 
  select(skim_type,skim_variable,complete_rate) %>% 
  filter(complete_rate >0.30) #288 Variables

#skim_type - character or numeric
#skim_variable - name of variable
#complete_rate - % of data that is not missing
#filter - only keep variables that have 30% of data populated

Additionally let’s look at how variables vary by year to see if there are any discrepancies there


#It looks like one year, there were fewer games played, and there is a clear drop off in home runs
Batter_data_dist =
Batter_data %>% 
 group_by(Season) %>% 
  summarize (Games_played = max(G),
             Avg_HR= mean(HR)
             )
Batter_data_dist

ggplot(Batter_data_dist, aes(Season, Avg_HR)) +
  geom_col()+
  ggtitle("Average Home Runs by Year")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))


2.3 Cleaning and Creating Initial Dataset for Model

What are some issues with the data?

  1. Many of Variables, such as K%, are being read in as characters

    • Only Team and Player Name should be characters
  2. There is spotty data coverage in some of the variables (~Variables have less than 30% Coverage)

  3. 2020 Data only includes 60 games worth of data

    • This was a season shortened due to Covid-19
  4. Team Data needs to be appended to Batter Data by Team Name


2.3.1 Cleanly Changing all Variables that are characters to numeric.

There are several ways to do this, we will identify the variables we want to change that are mis-identified. parse_number can be used to pull numbers from these variables. Additional ways to tackle this can be found here


#Select Column names that are characters but not Team or Name, These should be percentages
Batter_data_chars_to_convert <- Batter_data %>% 
  select_if(is.character)%>% select(-Team,-Name) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Note : There are additional ways to do this, this is just one solution


#We can exclude the variables we converted and reintroduce them
Batter_data_num <- Batter_data %>% select(-colnames(Batter_data_chars_to_convert))

Batter_data2 = cbind(Batter_data_num,Batter_data_chars_to_convert) %>% 
  select (colnames(Batter_data)) %>%  #preserve original order 
  dplyr::rename(flyball_perc = `FB%...46`,fastball_perc = `FB%...73`) #rename two ambiguous columns
  
skim(Batter_data2) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()


#Logical variables are R's best guess, in our case they are all NA's and will be removed

The same can be done for the Team Data that is loaded


#Select Column names that are characters but not Team or Name, These should be percentages
FDG_Team2_chars_to_convert <- FDG_Team2 %>% 
  select_if(is.character)%>% select(-T_Team) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Keep in mind, parse number may make actual characters into numerical variables so carefully check your data before using

#We can exclude the variables we converted and reintroduce them
FDG_Team2_num <- FDG_Team2 %>% select(-colnames(FDG_Team2_chars_to_convert))

FDG_Team3 = cbind(FDG_Team2_num,FDG_Team2_chars_to_convert) %>% 
  select (colnames(FDG_Team2)) %>%  #preserve original order
dplyr::rename(T_flyball_perc = `T_FB%...45`,T_fastball_perc = `T_FB%...72`) 

skim(FDG_Team3) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()

2.3.2 Filtering Data with Low Coverage

I choose 30% coverage of data necessary but this can be adjusted up or down. This will also get rid of columns that are all NA.


# Keep variables with enough values (Need 30% data coverage rate here)
Player_cols_to_keep =
skim(Batter_data2) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)

#Transpose Rows to get column names as skim melts the data
Player_cols_to_keep_transpose = t(Player_cols_to_keep) 

#extract the colnames we would like to keep
Player_cols_to_keep = colnames(janitor::row_to_names(Player_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
Batter_data3 = Batter_data2 %>% 
  select(one_of(Player_cols_to_keep)) 

Repeat the process for Team Variables

Team_cols_to_keep =
skim(FDG_Team3) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)


#Transpose Rows to get column names as skim melts the data
Team_cols_to_keep_transpose = t(Team_cols_to_keep) 

#extract the colnames we would like to keep
Team_cols_to_keep = colnames(janitor::row_to_names(Team_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
FDG_Team4 = FDG_Team3 %>% 
  select(one_of(Team_cols_to_keep)) 

2.3.3 Creating Variables Normalized by Year

Some Variables will need to be normalized by Innings_Pitched (IP) if they aren’t a percentage already. Remaining Variables are percentages or indices so will not need to be transformed. The full data dictionary for these variables can be found on FanGraph’s website here. for pitching variables and here. for hitting variables.


Batter_data4 = Batter_data3 %>% 
  mutate( #create new variables based on existing variables
    H_PA = H/PA,
    x1B_PA = `1B`/PA, #note: R can't have variables start with a number
    x2b_PA = `2B`/PA,
    x3b_PA = `3B`/PA,
    HR_PA = HR/PA,
    R_PA = R/PA,
    RBI_PA = RBI/PA,
    BB_PA = BB/PA,
    IBB_PA = IBB/PA,
    SO_PA=SO/PA,
    HBP_PA=HBP/PA,
    SF_PA=SF/PA,
    SH_PA=SH/PA,
    GDP_PA= GDP/PA,#ground into double play
    SB_PA=SB/PA,
    CS_PA=CS/PA,
    GB_PA = GB/PA,   #Groundballs
    FB_PA =  FB/PA,  #FlyBalls
    LD_PA = LD/PA,   #LineDrives
    IFFB_PA = IFFB/PA,  #Infield Fly balls
    Pitches_PA= Pitches/PA,
    Balls_PA= Balls/PA,
    Strikes_PA= Strikes/PA,
    IFH_PA= IFH/PA,
    BU_PA= BU/PA,
    BUH_PA= BUH/PA,
    PH_PA= PH/PA,
    Barrels_PA= Barrels/PA,
    HardHits_PA= HardHit/PA
  ) %>% select(-(H:CS),-(GB:BUH),-PH,-Barrels,-HardHit,-Events) #Drop the old variables

#skim(Batter_data4) %>% as_tibble()

Repeat the process for Team Variables


FDG_Team5 = FDG_Team4 %>% 
  mutate( #create new variables based on existing variables
    T_H_T_PA = T_H/T_PA,
    T_x1B_T_PA = T_1B/T_PA, #note: R can't have variables start with a number
    T_x2b_T_PA = T_2B/T_PA,
    T_x3b_T_PA = T_3B/T_PA,
    T_HR_T_PA = T_HR/T_PA,
    T_R_T_PA = T_R/T_PA,
    T_RBI_T_PA = T_RBI/T_PA,
    T_BB_T_PA = T_BB/T_PA,
    T_IBB_T_PA = T_IBB/T_PA,
    T_SO_T_PA=T_SO/T_PA,
    T_HBP_T_PA=T_HBP/T_PA,
    T_SF_T_PA=T_SF/T_PA,
    T_SH_T_PA=T_SH/T_PA,
    T_GDP_T_PA= T_GDP/T_PA,#ground into double play
    T_SB_T_PA=T_SB/T_PA,
    T_CS_T_PA=T_CS/T_PA,
    T_GB_T_PA = T_GB/T_PA,   #Groundballs
    T_FB_T_PA =  T_FB/T_PA,  #FlyBalls
    T_LD_T_PA = T_LD/T_PA,   #LineDrives
    T_IFFB_T_PA = T_IFFB/T_PA,  #Infield Fly balls
    T_Pitches_T_PA= T_Pitches/T_PA,
    T_Balls_T_PA= T_Balls/T_PA,
    T_Strikes_T_PA= T_Strikes/T_PA,
    T_IFH_T_PA= T_IFH/T_PA,
    T_BU_T_PA= T_BU/T_PA,
    T_BUH_T_PA= T_BUH/T_PA,
    T_PH_T_PA= T_PH/T_PA,
    T_Barrels_T_PA= T_Barrels/T_PA,
    T_HardHits_T_PA= T_HardHit/T_PA
  ) %>% select(-(T_H:T_CS),-(T_GB:T_BUH),-T_PH,-T_Barrels,-T_HardHit,-T_Events) #Drop the old variables


#skim(FDG_Team5) %>% as_tibble()

2.3.4 Creating Lagged Variables

There are several ways to lag a dataset BY GROUP.
* Dplyr way is here..
* The data.table (the method used below) is here.

#Note we will only be lagging the player level data, as the previous year's team performance shouldn't impact current performance


#Order the dataset by lag columns
Batter_data5 =  arrange(Batter_data4, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_batter = data.table(Batter_data5)

#designate columns to lag - which is all of them
cols1 = colnames(Batter_data5)
anscols = paste("lag", cols1, sep="_")
DT_batter[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

Batter_data6 = as.data.frame(DT_batter) %>% select(-lag_playerid, -lag_Team, -lag_Season, -lag_Age,-lag_Name)

ncol(Batter_data5) #287 - no lags
[1] 259
ncol(Batter_data6) #574 - lagged data ~ (287 * 2)-5
[1] 513

2.3.5 Merging Team and Player Data

We can use either the merge function or the SQL functionality provided by the sqldf package to join the lagged player level data to the Team level data


df_batting_init = sqldf(
  "
  select a.*, b.*
  from Batter_data6 a
  left join FDG_Team5 b
  on a.Team = b.T_Team and a.Season = b.T_Season
  
  "
)  %>% select(-T_Team,-T_Season,T_Age,T_G,T_AB)# Unncessary Team Variables


nrow(df_batting_init) - nrow(Batter_data6) #check if any rows are duplicated
[1] 0

3 Creating Rankings for Players Based On Percentiles

We can use Percentile based ranking to get rankings for players from the 2021 season.

3.1 Worth of each stat

3.1.1 Calculating past performance

Each player goes from a 0% to 100% on each percentile stat that is used for creating a scoring opportunity. Data is not yet normalized by PA as certain stats such as HRs and SBs will be worth more when we do.


#Categories I include are:
#Runs (R), Home Runs (HR), Runs Batted In (RBI), Stolen Bases (SB), Batting Average (AVG)
df_batting_init2 =  df_batting_init %>%
#  arrange(player_id,year) %>% 
  group_by(Season) %>% 
  mutate(
    Runs_share = order(order(rank(R_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     HR_share = order(order(rank(HR_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     RBI_share = order(order(rank(RBI_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     SB_share = order(order(rank(SB_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     AVG_share = order(order(rank(AVG,ties.method = 'average'),decreasing = FALSE))/n(),
    OPS_share = 0,
    Worth = Runs_share+HR_share+RBI_share+SB_share+AVG_share+OPS_share
    ) %>% 
  ungroup() 

Chart of the Distribution of initial percentiles
As the chart below shows, the data is roughly normal.


skewness((df_batting_init2$Worth))
[1] -0.25
ggplot2::qplot(df_batting_init2$Worth, main="Total Dataset") + geom_histogram(colour="black", fill="lime green")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

min(df_batting_init2$Worth)
[1] 0.029
max(df_batting_init2$Worth)
[1] 4.8
ggpubr::ggqqplot(df_batting_init2$Worth)


shapiro.test(df_batting_init2$Worth)

    Shapiro-Wilk normality test

data:  df_batting_init2$Worth
W = 1, p-value <0.0000000000000002

3.2 2021 Player Rankings - Per PA performance

3.2.1 2021 Player Rankings - Top Worth Player

There are per PA rankings. Players like Byron Buxton which had a great per PA score but can’t stay healthy for a season will be adjusted down.


options(digits=2)

df_batting_init2021 =
df_batting_init2 %>% 
  group_by(Name) %>% 
  filter(Season == 2021) %>% 
  arrange(desc(Worth)) %>% 
  select(Name,Runs_share,HR_share,RBI_share, SB_share,OPS_share,AVG_share,Worth)


df_batting_init2021 %>%
  filter (Worth>3.9) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)
Name Runs_share HR_share RBI_share SB_share OPS_share AVG_share Worth
Fernando Tatis Jr. 0.99 1.00 0.98 0.96 0 0.89 4.8
Ronald Acuna Jr. 1.00 0.98 0.82 0.96 0 0.90 4.7
Byron Buxton 1.00 0.99 0.70 0.92 0 0.97 4.6
Tyler O'Neill 0.96 0.97 0.86 0.88 0 0.91 4.6
Jose Ramirez 0.98 0.93 0.94 0.94 0 0.78 4.6
Teoscar Hernandez 0.92 0.90 1.00 0.80 0 0.94 4.6
Kyle Tucker 0.87 0.89 0.94 0.86 0 0.94 4.5
Bryce Harper 0.96 0.95 0.80 0.82 0 0.97 4.5
Bo Bichette 0.99 0.76 0.85 0.92 0 0.95 4.5
Shohei Ohtani 0.95 0.99 0.90 0.94 0 0.69 4.5
Javier Baez 0.87 0.93 0.92 0.91 0 0.77 4.4
Frank Schwindel 0.97 0.90 0.95 0.55 0 0.99 4.4
Vladimir Guerrero Jr. 0.99 0.98 0.92 0.49 0 0.98 4.4
Trea Turner 0.96 0.78 0.63 0.97 0 1.00 4.3
Brandon Crawford 0.85 0.78 0.94 0.80 0 0.95 4.3
Nick Castellanos 0.95 0.94 0.97 0.47 0 0.97 4.3
Marcus Semien 0.94 0.96 0.80 0.81 0 0.77 4.3
Juan Soto 0.96 0.80 0.83 0.70 0 0.99 4.3
Luis Robert 0.84 0.79 0.83 0.81 0 1.00 4.3
Brandon Belt 0.97 0.99 0.89 0.56 0 0.85 4.3
Paul Goldschmidt 0.90 0.83 0.84 0.76 0 0.94 4.3
Rafael Devers 0.91 0.93 0.97 0.55 0 0.88 4.2
Manny Machado 0.85 0.79 0.95 0.78 0 0.88 4.2
Jorge Polanco 0.90 0.88 0.88 0.76 0 0.80 4.2
A.J. Pollock 0.64 0.87 0.94 0.82 0 0.95 4.2
George Springer 0.98 0.97 0.84 0.66 0 0.76 4.2
Mike Trout 0.93 0.91 0.67 0.69 0 1.00 4.2
Aaron Judge 0.83 0.96 0.88 0.60 0 0.92 4.2
Ozzie Albies 0.90 0.78 0.88 0.89 0 0.71 4.2
Matt Olson 0.89 0.94 0.94 0.50 0 0.83 4.1
Adam Engel 0.89 0.88 0.72 0.97 0 0.64 4.1
Thairo Estrada 0.85 0.89 0.96 0.55 0 0.85 4.1
Brandon Lowe 0.93 0.97 0.93 0.66 0 0.59 4.1
Avisail Garcia 0.73 0.92 0.96 0.72 0 0.74 4.1
Freddie Freeman 0.98 0.80 0.64 0.66 0 0.95 4.0
Yordan Alvarez 0.91 0.91 0.98 0.34 0 0.87 4.0
Tim Anderson 0.97 0.57 0.54 0.90 0 0.98 4.0
Jesse Winker 0.93 0.87 0.84 0.36 0 0.96 4.0
Kyle Schwarber 0.95 0.98 0.87 0.36 0 0.78 3.9
Chas McCormick 0.88 0.79 0.90 0.68 0 0.69 3.9
Jake Meyers 0.77 0.68 0.98 0.77 0 0.72 3.9

4 Creating Model File

4.1 Additional Data Prep

4.1.1 Remove Variables which are based off current hitting numbers

Not all variables can be used for predictive modeling. Variables that go into the percentile ranking or are non-normalized metrics created after the fact (such as WAR - Wins above Replacement or RE24) should be removed. However, metrics that are normalized by a per pitch basis (such as HR/FB%+) can remain as we expect batters to have similar performance in these metrics one year out.

#Creating a new dataset to keep original intact
df_batting_init3 = df_batting_init2

Lagged Percentile (_share) Variables can be used for predictive modeling. However since these variables were created for the Worth metric they must also be removed for modeling purposes.


#Order the dataset by lag columns
df_batting_init4 =  arrange(df_batting_init3, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_batter2 = data.table(df_batting_init4)

#designate columns to lag - which is all of them
cols1 = (c('Runs_share','HR_share','RBI_share', 'SB_share','OPS_share','AVG_share','Worth'))
anscols = paste("lag", cols1, sep="_")
DT_batter2[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

#names(DT_batter2)

df_batting_final = as.data.frame(DT_batter2) %>% 
  select(-c(Runs_share,HR_share,RBI_share, SB_share,OPS_share,AVG_share))%>% 
  select(-(G:AVG),-(OBP:BABIP),-(wOBA:Dol),-(`wRC+`:REW),-(`WPA/LI`),-(wFB:wSF),-BsR,-(Def:wGDP),-(`wCH (pi)`:`wCH/C (pi)`),-(`AVG+`),-(`OBP+`:`BABIP+`),-(H_PA:PH_PA)) %>% select (-Name)

4.1.2 Creating Training/Test Split

We split the data into Training Data (which is used to create the model) and test data (which is used to validate the model)


set.seed(15674)  # For reproducibility
# Create index for testing and training data
inTrain <- createDataPartition(y = df_batting_final$Worth, p = 0.80, list = FALSE)
# subset pitching data for training
tr_2021 <- df_batting_final[inTrain,]
# subset the rest to test and validate trained model
te_2021 <- df_batting_final[-inTrain,]

nrow(tr_2021)/nrow(df_batting_final) #check if split is 0.8
[1] 0.8

4.1.3 Treat Missing Data by Imputing Mean Value

Vtreat Package in R is excellent for treating data before using for modeling. Additional documentation can be found here.
Note: The treatment plan also fixes variables names likeHR/FB%+ (which R doesn’t always handle the best) to HR_slash_FB_percent_plus_

treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = tr_2021, # training data
  varlist = colnames(tr_2021) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages

#clean stands for cleaned numerical variable, isBAD indicates that a value replacement has occurred (which indicates a missing value in this case), and lev is a binary indicator whether a particular value of that categorical variable was present.  

#### Checking Scoreframe

score_frame <- treat_plan_2021$scoreFrame %>% 
  select(varName, origName, code)

head(score_frame)


tr_treated_2021 <- vtreat::prepare(treat_plan_2021, tr_2021)
te_treated_2021 <- vtreat::prepare(treat_plan_2021, te_2021)


Total_dataset1_untreat = as.data.frame(DT_batter2) %>% select(-Name)

treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = Total_dataset1_untreat, # training data
  varlist = colnames(Total_dataset1_untreat) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages


total_treated_2021_hitting <- vtreat::prepare(treat_plan_2021, Total_dataset1_untreat)



#tr_treated = tr
#te_treated = te

dim(tr_treated_2021) #note there are dummies for each player and team
[1] 3424 1396

4.1.4 Check Distribution of Training Population

The population used for Training should be indicative of Total Population


ggplot2::qplot(tr_treated_2021$Worth, main="Training Set") + geom_histogram(colour="black", fill="limegreen") + theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(tr_treated_2021$Worth) #The skewness is the same as the overall
[1] -0.24

5 Running XGboost Model

To keep things simple with modeling, we’ll turn the training data into simple input variables for caret::train, dropping the response variable and converting the data frame to a matrix. Documentation for this approach to XGboost can be found here.

5.1 Tuning the Model

5.1.1 Initial Non-Tuned Model

Break the data set into x and y inputs with x being a matrix

input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>%                      
   select(!ends_with ("_isBAD")))

input_y <- tr_treated_2021$Worth

XGBoost with Default Hyperparameters
The Variable Importance (caret::varImp(xgb_base_2021, scale = F )) from the caret package shows the contribution of each variable to the initial model. As you can see SLG_plus_ (SLG+) takes up much of the importance as it is derived from SLG (one of the key contributors to Worth). These types of variables will be removed during variable selection in the next step.
XGBoost documentation can be found for more general models here.


#Defaults for xgboost model
grid_default <- expand.grid(
  nrounds = 100,
  max_depth = 6,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

#This is a blank train_control set, this will be updated after
train_control <- caret::trainControl(
  method = "none",
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )
xgbTree variable importance

  only 20 most important variables shown (out of 764)

5.2 Further Variable Selection

5.2.1 Remove redundant and highly correlated variables

Selection Removal Step 1: Check for high correlations
Normally, this step is done early, but those steps were reserved for preparing the data


dep_cor1 <- t(as.data.frame(cor(tr_treated_2021[ , colnames(tr_treated_2021) != "Worth"],
                tr_treated_2021$Worth)))
dep_cor1 <-
as.data.frame(t(as.data.frame(dep_cor1)%>% 
  select(!starts_with("lag")) %>% #remove lag variables
  select(!contains("_isBAD")))) 

dep_cor1 <- tibble::rownames_to_column(dep_cor1,"VARIABLES")%>% #remove indicators for missing data
  filter(V1 > 0.70|V1 < -0.5)

dep_cor1

dep_cor2 <- colnames(row_to_names(t(dep_cor1),row_number = 1))

Let’s Remove variables with high correlation to worth metric (such as wFB/C)


input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>% #Remove dependent variable
     select (-all_of(dep_cor2) ) %>%      
select(!ends_with ("_isBAD"))) #indicator variable for missing data

input_y <- tr_treated_2021$Worth

Run the model on the new dataset to make sure the variable importances look fine


#Note Training parameters were set in initial model set up
xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )
xgbTree variable importance

  only 20 most important variables shown (out of 763)

5.3 Model with new data

5.3.1 Tuning All Hyperparameters

A tune grid allows us to test a large amount of hyper-parameters and find the model with the lowest RMSE for predictions.
However, The more values you want to test and the greater the amount of Cross-Fold Validations (method = "cv"), the greater the computational time it will take. More information on the specific parameters can be found here.


# maximum number of trees
nrounds <- 1000

# note to start nrounds from 200, as smaller learning rates result in errors so
# big with lower starting points that they'll mess the scales
tune_grid <- expand.grid(
  nrounds = seq(from = 100, to = nrounds, by = 50),
  eta = c(0.01, 0.025, 0.05, 0.1),
  max_depth = c(2, 4, 6, 8),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

tune_control <- caret::trainControl(
  method = "cv", # cross-validation
  number = 5, # with n folds 
  ## Note this was # out in the original code
  #index = createFolds(tr_treated$Id_clean), # fix the folds
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

Running the initial tuning model

#Note I will be timing these runs to give an estimate on how long this model takes to run
start_time <- Sys.time()

xgb_tune_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid,
  method = "xgbTree",
  verbose = FALSE
  ,verbosity = 0
)

end_time <- Sys.time()

end_time - start_time
Time difference of 12 mins

Tuning Plot and Variable Importance

varImp(xgb_tune_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 763)
# helper function for the plots
tuneplot <- function(x, probs = .90) {
  ggplot(x) +
    coord_cartesian(ylim = c(quantile(x$results$RMSE, probs = probs), min(x$results$RMSE))) +
    theme_bw()
}

tuneplot(xgb_tune_2021)

5.3.2 Fine Tuning Model

5.3.2.1 Second Tuning: Maximum Depth and Minimum Child Weight

After fixing the learning rate to 0.1 and we’ll also set maximum depth to 3 +-1 (or +2 if max_depth == 2) to experiment a bit around the suggested best tune in previous step. Then, well fix maximum depth and minimum child weigh

tune_grid2 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = ifelse(xgb_tune_2021$bestTune$max_depth == 2,
    c(xgb_tune_2021$bestTune$max_depth:4),
    xgb_tune_2021$bestTune$max_depth - 1:xgb_tune_2021$bestTune$max_depth + 1),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = c(1, 2, 3),
  subsample = 1
)

xgb_tune2_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid2,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune2_2021)


xgb_tune2_2021$bestTune

varImp(xgb_tune2_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 763)

5.3.2.2 Third Tuning: Column and Row Sampling


tune_grid3 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = 0,
  colsample_bytree = c(0.4, 0.6, 0.8, 1.0),
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = c(0.5, 0.75, 1.0)
)

xgb_tune3_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid3,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune3_2021, probs = .95)


xgb_tune3_2021$bestTune

varImp(xgb_tune3_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 763)

5.3.2.3 Fourth Tuning: Gamma

Next, we again pick the best values from previous step, and now will see whether changing the gamma has any effect on the model fit:

tune_grid4 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = c(0, 0.05,0.1, 0.2,0.4, 0.5, 0.7, 0.9, 1.0),
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)

xgb_tune4_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid4,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune4_2021)
Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult
to discriminate; you have 9. Consider specifying shapes manually if you must have them.
Warning: Removed 60 rows containing missing values (geom_point).

xgb_tune4_2021$bestTune

varImp(xgb_tune4_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 763)

5.3.2.4 Fifth Tuning: Reducing the Learning Rate

Now, we have tuned the hyperparameters and can start reducing the learning rate to get to the final model:

start_time <- Sys.time()

tune_grid5 <- expand.grid(
  nrounds = seq(from = 100, to = 10000, by = 75),
   eta = c(0.01, 0.015, 0.025,0.035, 0.05,0.75, 0.1),
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = xgb_tune4_2021$bestTune$gamma,
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)



xgb_tune5_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid5,
  method = "xgbTree",
  verbose = TRUE
)

#tuneplot(xgb_tune5_2021)

end_time <- Sys.time()

end_time - start_time
Time difference of 19 mins
xgb_tune5_2021$bestTune

varImp(xgb_tune5_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 763)

5.3.2.5 Fitting Final Model


(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))
eXtreme Gradient Boosting 

3424 samples
 763 predictor

No pre-processing
Resampling: None 
varImp(xgb_model_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 763)

5.3.3 Model Performance

5.3.3.1 Checking Model on Test Split Data



y_pred_test <- predict(xgb_model_2021, data.matrix(te_treated_2021))

test_stats= cbind((te_treated_2021$Worth),y_pred_test)

test_statsR2 = cor(test_stats[,1],test_stats[,2])^2

print(test_statsR2)
[1] 0.89
y_pred_train <- predict(xgb_model_2021, data.matrix(tr_treated_2021))

train_stats = cbind((tr_treated_2021$Worth),y_pred_train)

train_statsR2 = cor(train_stats[,1],train_stats[,2])^2

print(train_statsR2)
[1] 0.96
#test dataset
x <- select(te_treated_2021, -Worth)
y <- (te_treated_2021$Worth)

(xgb_model_rmse <- ModelMetrics::rmse(y, predict(xgb_model_2021, newdata = x)))
[1] 0.34
holdout_x <- select(tr_treated_2021, -Worth)
holdout_y <- tr_treated_2021$Worth

(xgb_model_rmse <- ModelMetrics::rmse(holdout_y, predict(xgb_model_2021, newdata = holdout_x)))
[1] 0.21

5.3.3.2 Graphical Representation of Model


ggplot2::ggplot() +
  aes(x = test_stats[,1], y = test_stats[,2]) +
  geom_jitter() +
  xlab("Predicted Values") +
  ylab("Actual Values") +
  ggtitle("Results of Hitting Model on Test Data")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))+
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

6 Creating 2022 Projections from Model

6.1 Re-fit model for Important Variables

Now that we have an acceptable model, we can use it to create projections for how well we think players should do in 2022 based on their hitting statistics in 2021. First let’s reduce

  1. Only keep variables with high enough importance in model


vip(xgb_model_2021, num_features = 30)  # 10 is the default, 30 gives a visual on the top 30 most important features of the model


unscalevi = vi(xgb_model_2021, method="model") #shows the numbers behind the plot

unscalevi$Importance_perc = with(unscalevi,Importance/sum(Importance)) #adds percentages 

unscalevi # importance by variables

variables_to_keep_2021 = subset(unscalevi, Importance_perc > 0.0010) %>% select(Variable) #Keep Variables that explain at least a small amount [0.1%] of the model. This is a low threshold for inclusion ,but you can adjust this

variables_to_keep_2021b = t(variables_to_keep_2021)

variables_to_keep_2022 = colnames(row_to_names(variables_to_keep_2021b,row_number = 1))

tr_treated_2022 = tr_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),starts_with("Team_lev_x_")) #keep modeled important variables along with team indicator variables

te_treated_2022 = te_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),starts_with("Team_lev_x_"))

input_x_2022 = as.matrix(select(tr_treated_2022, -Worth))

input_y_2022 = tr_treated_2022$Worth
  1. Re-fit model with reduced variable scope


(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2022 <- caret::train(
  x = input_x_2022,
  y = input_y_2022,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))
eXtreme Gradient Boosting 

3424 samples
  76 predictor

No pre-processing
Resampling: None 
vip(xgb_model_2022, num_features = 30)


unscalevi24 = vi(xgb_model_2022, method="model")

unscalevi24$Importance_perc = with(unscalevi24,Importance/sum(Importance)) 

unscalevi24

save(xgb_model_2022,file = '2022_Hitting5x5_Model.Rdata')

hitting5x5 = xgb_model_2022

hittinginput = input_x_2022

#For anything above breaking_IP we need to create projection table by age or age bucket

#write_csv(unscalevi24,"unscalevi24.csv")
# 2022 Projections Full
First let’s prepare a file for predicting based on our model object
```r
variableslag5xb= row_to_names(as.data.frame(t(variables_to_keep_2022)),row_number = 1) %>% select (starts_with(“lag”))
variables_nolag5xb = (owmr::remove_prefix(variableslag5xb,“lag” , sep = “_“))
Data_Predict_2022a5xb = total_treated_2021_hitting %>% select (one_of(colnames(variables_nolag5xb)),Season,playerid)
colnames(Data_Predict_2022a5xb) <- paste0(“lag_”, colnames(Data_Predict_2022a5xb))
Data_Predict_2022b5xb = total_treated_2021_hitting %>% select (one_of(colnames(variables_nolag5xb))) colnames(Data_Predict_2022b5xb) = colnames(variableslag5xb)
variables_to_keep_2022_nolag5xb = total_treated_2021_hitting %>% select(one_of(variables_to_keep_2022),Season,playerid,starts_with(“Team_lev_x_”))%>% select(-one_of(colnames(Data_Predict_2022b5xb)))
Data_predict_20225xb = sqldf( ” select a.,b. from Data_Predict_2022a5xb a, variables_to_keep_2022_nolag5xb b on b.playerid = a.lag_playerid and b.Season = a.lag_Season ” ) %>% select(-lag_playerid,lag_Season) %>% filter(Season == 2021) %>% select(one_of(variables_to_keep_2022),starts_with(“Team_lev_x_”))
```

6.2 Create Predictions for Model

6.2.1 Run Projections on Players who Played in 2021

This is the raw prediction score per IP for each pitcher


hitting_predictions5xb = as.data.frame(predict(xgb_model_2022,Data_predict_20225xb))

names(hitting_predictions5xb) = c("Predict_Score")

Data_predict_2022_w_hitting_Predictions5xb = cbind(Data_predict_20225xb,hitting_predictions5xb) %>% select(playerid,Predict_Score)

head(Data_predict_2022_w_hitting_Predictions5xb)
NA


Latest_2022_hittingdata_FP = read_csv("FanGraph_Fantasy_Baseball_Hitting.csv")
Rows: 625 Columns: 28
-- Column specification -------------------------------------------------------------------------------------------
Delimiter: ","
chr  (3): Name, Team, playerid
dbl (25): G, PA, AB, H, 2B, 3B, HR, R, RBI, BB, SO, HBP, SB, CS, AVG, OBP, SLG, OPS, wOBA, wRC+, WAR, ADP, Inte...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Latest_2022_hittingdata_FP
NA
NA



hitting_Data_NonAdj_Projections5xb = sqldf(
  "
  select a.*,b.Predict_Score
  from Latest_2022_hittingdata_FP a 
  left join 
  Data_predict_2022_w_hitting_Predictions5xb b
  on a.playerid = b.playerid
  "
) %>% filter(ADP<370 | is.na(Predict_Score)==F)


hitting_Data_Adj_Projections5xb =
hitting_Data_NonAdj_Projections5xb %>% 
  mutate(
    Avg_PA = 300,
    AdjPredict_Score_raw = ifelse(is.na(Predict_Score),NA,Predict_Score*(PA/Avg_PA)),
    max_predscore= max(AdjPredict_Score_raw,na.rm = T),
    AdjPredict_Score = ifelse (is.na(AdjPredict_Score_raw),NA,AdjPredict_Score_raw *100/max_predscore),
    WAR_rank = order(order(rank(WAR,ties.method = 'average'),decreasing = TRUE)),
    AdjPredict_Score_Rank = order(order(rank(AdjPredict_Score,ties.method = 'average'),decreasing = TRUE))-sum(is.na(AdjPredict_Score)),
        Ranks_Above_ADP = ADP - AdjPredict_Score_Rank
  ) %>% select (Name,ADP,WAR, WAR_rank,AdjPredict_Score ,AdjPredict_Score_Rank,Ranks_Above_ADP)


  

ggplot2::qplot(hitting_Data_Adj_Projections5xb$AdjPredict_Score, main="Predictions") + geom_histogram(colour="black", fill="grey") + theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


7 2022 Projections Full

7.1 Table of hitting Projections (Players who Didn’t Play in 2021 - Recieve an NA)

AdjPredict_Score are normalized to 100


tableexport =
hitting_Data_Adj_Projections5xb %>%
  arrange (ADP,WAR) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)

save_kable(tableexport,file = "hitting5x5.html")

#tableexport

This is a better formatted Table



ft_dt <- hitting_Data_Adj_Projections5xb[1:nrow(hitting_Data_Adj_Projections5xb), 1:ncol(hitting_Data_Adj_Projections5xb)] %>% 
  filter(AdjPredict_Score_Rank>0)%>%  arrange((AdjPredict_Score_Rank))

ft_dt$ADP <- color_tile("white", "red")(ft_dt$ADP)

ft_dt$WAR <- color_bar("lightblue")(ft_dt$WAR)

ft_dt$AdjPredict_Score<- color_bar("lightblue")(ft_dt$AdjPredict_Score)

ft_dt$WAR_Rank <- color_tile("green","orange")(ft_dt$WAR_rank)

ft_dt$Predict_Rank <- color_tile("green","orange")(ft_dt$AdjPredict_Score_Rank) 


ft_dt$Ranks_Above_ADP <- 
  ifelse(
  ft_dt$Ranks_Above_ADP < 0,
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "red", italic = T),
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "green", italic = T)
)


ft_dt2 <- ft_dt[c("Name", "ADP", "WAR", "AdjPredict_Score", "WAR_Rank","Predict_Rank","Ranks_Above_ADP")]



table_export = 
kbl(ft_dt2, escape = F) %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T) %>%   column_spec(6, width = "3cm") %>%
  add_header_above(c(" ", "Scores" = 3, "Ranks" = 2," "))

save_kable(table_export,file = "Hitting5x5_updated.html")
  
table_export  
Scores
Ranks
Name ADP WAR AdjPredict_Score WAR_Rank Predict_Rank Ranks_Above_ADP
Vladimir Guerrero Jr. 5.6 5.6 100.00 5 1 4.6
Bryce Harper 9.1 5.2 99.88 8 2 7.1
Bo Bichette 4.9 4.7 99.41 13 3 1.9
Marcus Semien 39.0 4.7 99.38 14 4 35
Luis Robert 16.2 4.6 97.61 16 5 11.2
Trea Turner 1.4 5.5 97.41 6 6 -4.6
Rafael Devers 14.5 4.8 96.67 11 7 7.5
Nick Castellanos 60.6 2.5 95.83 81 8 52.6
Freddie Freeman 19.2 5.1 95.71 9 9 10.2
Shohei Ohtani 9.2 3.9 94.98 34 10 -0.8
José Ramírez 3.4 5.9 94.93 3 11 -7.6
Teoscar Hernández 27.8 2.3 94.24 98 12 15.8
Juan Soto 3.9 7.6 94.00 1 13 -9.1
Cedric Mullins 32.8 3.3 94.00 52 14 18.8
Kyle Tucker 12.7 4.6 93.76 15 15 -2.3
Paul Goldschmidt 47.6 3.5 92.99 44 16 31.6
Jorge Polanco 77.8 3.1 90.47 58 17 60.8
Mookie Betts 15.4 5.8 89.60 4 18 -2.6
Jose Altuve 73.3 3.8 89.15 36 19 54.3
Tim Anderson 30.9 3.2 89.03 55 20 10.9
Mike Trout 13.0 6.6 88.43 2 21 -8
Manny Machado 22.1 4.5 88.38 20 22 0.1
Matt Olson 39.0 4.2 87.89 28 23 16
Aaron Judge 36.0 5.3 87.77 7 24 12
George Springer 57.8 4.3 87.01 26 25 32.8
Pete Alonso 46.2 3.5 86.25 43 26 20.2
Tyler O'Neill 47.1 2.8 86.06 67 27 20.1
Bryan Reynolds 93.0 4.2 85.88 30 28 65
Byron Buxton 48.0 4.4 85.84 23 29 19
Xander Bogaerts 45.8 4.5 84.61 19 30 15.8
Brandon Lowe 75.4 4.3 84.56 24 31 44.4
Ozzie Albies 17.6 4.1 83.75 32 32 -14.4
Ronald Acuña Jr. 10.4 5.0 83.51 10 33 -22.6
Jonathan India 90.7 3.8 83.07 35 34 56.7
José Abreu 78.4 2.1 82.26 114 35 43.4
Trevor Story 36.8 4.2 82.16 29 36 0.8
Javier Báez 62.5 2.8 82.11 68 37 25.5
Frank Schwindel 233.8 0.8 81.23 262 38 195.8
Starling Marte 31.2 2.4 80.74 84 39 -7.8
Yordan Alvarez 26.3 4.1 80.67 31 40 -13.7
Austin Riley 51.8 3.1 80.32 59 41 10.8
Mitch Haniger 109.0 2.4 80.11 89 42 67
Kris Bryant 84.9 2.4 79.90 86 43 41.9
Corey Seager 78.3 4.5 79.66 22 44 34.3
Jared Walsh 119.8 2.2 79.24 101 45 74.8
Ketel Marte 77.2 3.7 79.16 37 46 31.2
Jesse Winker 111.8 3.4 79.11 45 47 64.8
Joey Votto 141.9 1.9 78.73 130 48 93.9
Dansby Swanson 119.1 2.7 78.12 72 49 70.1
Carlos Correa 100.9 4.7 78.00 12 50 50.9
Brandon Crawford 219.2 2.7 77.49 70 51 168.2
Randy Arozarena 65.4 2.1 77.17 113 52 13.4
Wander Franco 58.9 4.5 77.16 21 53 5.9
Kyle Schwarber 114.5 2.7 77.12 71 54 60.5
Salvador Perez 32.2 3.7 76.97 38 55 -22.8
Nolan Arenado 67.7 3.7 76.69 39 56 11.7
Avisaíl García 170.2 2.3 76.66 93 57 113.2
J.D. Martinez 88.9 2.0 76.37 116 58 30.9
Rhys Hoskins 126.3 3.0 75.91 61 59 67.3
Franmil Reyes 120.9 1.8 74.37 132 60 60.9
Brandon Belt 226.3 2.5 73.93 80 61 165.3
Ryan Mountcastle 119.7 1.6 73.85 161 62 57.7
Willy Adames 123.2 3.3 73.61 49 63 60.2
J.T. Realmuto 48.8 3.6 73.05 40 64 -15.2
C.J. Cron 124.7 1.7 72.21 154 65 59.7
Francisco Lindor 53.5 4.2 72.18 27 66 -12.5
Hunter Renfroe 161.4 1.7 71.37 153 67 94.4
Max Muncy 134.9 3.2 71.16 54 68 66.9
Jake Cronenworth 127.1 3.4 70.56 48 69 58.1
Jazz Chisholm Jr. 76.0 1.8 70.47 138 70 6
Josh Bell 127.6 1.9 70.46 129 71 56.6
Justin Turner 152.7 3.0 69.78 62 72 80.7
Nelson Cruz 171.9 1.4 69.41 181 73 98.9
Joey Gallo 177.0 3.5 68.97 42 74 103
Whit Merrifield 32.5 2.3 68.82 97 75 -42.5
Eduardo Escobar 200.5 1.9 68.03 131 76 124.5
Kolten Wong 190.8 2.5 66.98 79 77 113.8
Lourdes Gurriel Jr. 145.2 1.4 66.53 187 78 67.2
Anthony Rizzo 193.6 2.1 66.08 109 79 114.6
Chris Taylor 139.9 2.3 65.76 95 80 59.9
Giancarlo Stanton 97.7 2.3 65.62 99 81 16.7
Ryan McMahon 161.2 2.0 65.58 119 82 79.2
Alex Verdugo 168.6 2.4 65.55 85 83 85.6
Will Smith 53.2 4.6 65.26 17 84 -30.8
Luis Urías 173.3 2.5 65.19 75 85 88.3
Mark Canha 283.4 1.8 65.16 141 86 197.4
Austin Meadows 145.5 1.5 65.05 168 87 58.5
Jean Segura 202.3 2.4 64.95 91 88 114.3
Adolis García 166.0 1.3 64.94 201 89 77
Mike Yastrzemski 286.5 2.2 64.74 108 90 196.5
Austin Hays 243.2 1.7 64.28 150 91 152.2
Ian Happ 213.2 2.2 63.86 104 92 121.2
Tommy Edman 90.2 2.3 63.28 94 93 -2.8
Max Kepler 294.6 2.5 63.24 76 94 200.6
AJ Pollock 218.6 1.8 63.02 136 95 123.6
Jonathan Schoop 213.2 1.7 62.96 145 96 117.2
Miguel Sanó 280.4 0.8 62.56 271 97 183.4
Yasmani Grandal 101.1 4.0 62.55 33 98 3.1
Dylan Carlson 172.0 2.5 62.54 77 99 73
Tyler Naquin 352.7 0.7 61.98 274 100 252.7
Nathaniel Lowe 251.7 1.8 61.88 133 101 150.7
Ramón Laureano 247.0 2.6 61.76 74 102 145
Enrique Hernández 226.7 2.8 61.48 66 103 123.7
Trent Grisham 143.5 3.1 61.47 57 104 39.5
Yoán Moncada 158.7 3.4 61.27 47 105 53.7
Jeimer Candelario 239.6 3.2 61.21 56 106 133.6
Ty France 154.6 2.4 61.21 88 107 47.6
Brendan Rodgers 169.9 1.6 61.10 160 108 61.9
LaMonte Wade Jr. 343.8 1.0 60.95 233 109 234.8
Adam Duvall 212.8 1.7 60.72 148 110 102.8
Amed Rosario 170.7 2.0 60.56 118 111 59.7
Akil Baddoo 156.9 1.6 60.33 158 112 44.9
Jesús Sánchez 226.3 2.0 60.19 115 113 113.3
Andrew Benintendi 192.7 1.9 59.98 125 114 78.7
Jorge Soler 192.2 1.7 59.05 147 115 77.2
Charlie Blackmon 236.7 0.6 58.78 301 116 120.7
Adalberto Mondesi 57.3 2.0 58.71 123 117 -59.7
Bobby Dalbec 229.2 1.6 58.46 157 118 111.2
Christian Yelich 97.2 2.9 58.33 64 119 -21.8
Trey Mancini 205.0 1.4 57.55 180 120 85
Eloy Jiménez 65.1 2.2 57.36 105 121 -55.9
Adam Frazier 400.3 2.0 57.26 117 122 278.3
Yuli Gurriel 210.6 1.4 57.17 189 123 87.6
Robbie Grossman 191.3 1.7 57.15 149 124 67.3
Patrick Wisdom 358.3 1.2 56.75 214 125 233.3
Willson Contreras 109.9 2.7 56.64 69 126 -16.1
Andrew McCutchen 348.9 0.9 56.56 254 127 221.9
Matt Chapman 181.5 3.6 56.49 41 128 53.5
Alex Bregman 88.3 4.5 56.21 18 129 -40.7
Harrison Bader 235.8 2.9 56.03 63 130 105.8
Eddie Rosario 181.7 0.9 55.86 251 131 50.7
Wil Myers 288.9 1.0 55.59 231 132 156.9
Josh Donaldson 187.1 3.3 55.07 51 133 54.1
Brandon Nimmo 324.3 3.2 54.97 53 134 190.3
Nicky Lopez 250.4 1.8 54.62 140 135 115.4
César Hernández 403.0 1.7 54.18 152 136 267
Seth Brown 502.2 0.6 53.86 293 137 365.2
Fernando Tatis Jr. 27.5 3.4 53.75 46 138 -110.5
Rafael Ortega 337.3 1.3 53.19 193 139 198.3
Miguel Rojas 475.9 2.3 52.60 96 140 335.9
Josh Rojas 247.6 1.5 52.36 167 141 106.6
Jo Adell 228.4 0.6 52.31 300 142 86.4
Joey Wendle 403.7 1.3 52.24 194 143 260.7
Eugenio Suárez 198.5 2.2 51.30 107 144 54.5
Michael Brantley 251.4 2.0 51.19 122 145 106.4
Tony Kemp 459.8 1.6 50.89 155 146 313.8
Alex Kirilloff 177.6 1.5 50.70 165 147 30.6
Daulton Varsho 93.3 2.1 50.36 110 148 -54.7
Anthony Santander 298.7 1.3 49.90 197 149 149.7
Tyler Stephenson 138.3 2.9 49.84 65 150 -11.7
Tommy Pham 274.8 1.9 49.83 126 151 123.8
DJ LeMahieu 119.2 2.4 49.39 87 152 -32.8
Ke'Bryan Hayes 142.6 3.3 49.38 50 153 -10.4
Evan Longoria 455.5 1.5 49.25 174 154 301.5
Gleyber Torres 154.9 2.5 48.80 78 155 -0.1
Jonathan Villar 253.1 1.1 48.75 225 156 97.1
Andrew Vaughn 250.3 1.2 48.73 207 157 93.3
Raimel Tapia 273.9 0.1 48.60 384 158 115.9
Randal Grichuk 419.6 0.6 48.42 292 159 260.6
Michael Conforto 207.5 2.7 48.19 73 160 47.5
Isiah Kiner-Falefa 296.8 1.6 48.16 163 161 135.8
Mike Zunino 253.5 2.0 47.69 124 162 91.5
J.P. Crawford 452.2 2.5 46.58 83 163 289.2
Myles Straw 127.0 2.3 46.58 100 164 -37
Mitch Garver 166.1 2.4 46.28 90 165 1.1
David Fletcher 417.0 1.3 46.27 191 166 251
Jesús Aguilar 300.5 1.2 46.26 215 167 133.5
Odúbel Herrera 583.2 1.1 45.65 228 168 415.2
Luke Voit 253.5 1.3 45.65 199 169 84.5
Joc Pederson 480.8 0.8 45.62 270 170 310.8
Lorenzo Cain 451.3 1.5 45.51 176 171 280.3
Paul DeJong 491.4 2.1 45.43 112 172 319.4
Eric Hosmer 470.1 0.4 45.09 329 173 297.1
Jake Meyers 593.2 1.1 45.06 217 174 419.2
Luis Arraez 351.3 1.5 44.88 177 175 176.3
Gavin Sheets 525.8 0.8 44.21 265 176 349.8
Jarred Kelenic 141.5 1.5 44.10 169 177 -35.5
Brad Miller 551.8 1.4 44.07 179 178 373.8
Nick Madrigal 342.1 1.8 43.69 139 179 163.1
Manuel Margot 416.6 1.1 43.57 229 180 236.6
Sam Hilliard 477.8 0.5 43.54 309 181 296.8
David Peralta 518.1 1.1 43.01 221 182 336.1
Wilmer Flores 472.5 1.2 42.95 208 183 289.5
Alcides Escobar 586.4 0.2 42.92 374 184 402.4
Lane Thomas 269.2 1.1 42.87 226 185 84.2
Pavin Smith 476.9 0.6 42.59 286 186 290.9
Bobby Bradley 491.0 0.5 42.42 308 187 304
Darin Ruf 454.5 0.9 42.25 247 188 266.5
Connor Joe 341.2 0.8 42.23 255 189 152.2
Jeff McNeil 323.0 1.9 42.18 128 190 133
Ramón Urías 529.7 1.6 41.95 162 191 338.7
Adam Engel 597.3 0.6 41.80 290 192 405.3
Gio Urshela 306.8 1.3 41.80 192 193 113.8
Ben Gamel 579.0 0.9 41.63 242 194 385
José Iglesias 547.7 1.2 41.53 210 195 352.7
Josh Harrison 412.3 1.1 41.53 224 196 216.3
Yandy Díaz 464.5 1.6 41.40 156 197 267.5
Cody Bellinger 98.9 3.1 41.03 60 198 -99.1
Tyrone Taylor 526.6 1.1 41.03 222 199 327.6
Danny Jansen 403.1 1.9 40.25 127 200 203.1
Kyle Farmer 528.0 1.1 40.14 220 201 327
Alec Bohm 311.6 1.4 40.08 188 202 109.6
Michael A. Taylor 516.5 1.2 40.06 212 203 313.5
Hunter Dozier 443.8 0.6 39.85 288 204 239.8
Garrett Cooper 527.1 1.0 39.51 238 205 322.1
Carlos Santana 511.8 0.7 39.08 272 206 305.8
Chas McCormick 495.8 1.1 39.04 218 207 288.8
Elias Díaz 229.7 1.3 38.97 198 208 21.7
Kyle Lewis 399.0 1.5 38.79 170 209 190
Christian Walker 506.6 0.7 38.58 276 210 296.6
Steven Duggar 597.8 0.7 38.56 275 211 386.8
Brian Anderson 495.9 1.8 38.53 137 212 283.9
Ji-Man Choi 577.3 0.7 38.52 282 213 364.3
Kevin Kiermaier 588.0 1.2 38.43 203 214 374
Andrés Giménez 282.5 1.7 38.34 146 215 67.5
Chad Pinder 570.3 0.6 38.01 291 216 354.3
Eric Haase 302.9 0.7 37.96 278 217 85.9
Max Stassi 300.6 1.8 37.64 135 218 82.6
Jake Fraley 547.6 0.9 37.51 241 219 328.6
Sean Murphy 252.4 2.5 37.39 82 220 32.4
Yadiel Hernandez 583.5 0.1 37.37 400 221 362.5
Bradley Zimmer 505.3 0.7 36.89 280 222 283.3
Gavin Lux 266.4 1.5 36.79 178 223 43.4
Tommy La Stella 561.3 1.2 36.41 213 224 337.3
Rowdy Tellez 342.4 0.8 36.24 268 225 117.4
Nick Ahmed 581.5 1.1 36.05 223 226 355.5
Marcell Ozuna 174.3 2.2 35.95 103 227 -52.7
Leury García 536.2 0.7 35.69 281 228 308.2
Garrett Hampson 299.6 0.3 35.60 340 229 70.6
Gary Sánchez 245.3 1.3 35.50 196 230 15.3
Christian Vázquez 201.1 1.7 35.34 142 231 -29.9
Edmundo Sosa 501.9 1.5 35.19 172 232 269.9
Anthony Rendon 110.2 4.3 35.15 25 233 -122.8
Keibert Ruiz 145.4 2.2 35.06 106 234 -88.6
Austin Slater 575.2 1.0 34.51 232 235 340.2
Bryan De La Cruz 534.7 0.9 34.25 240 236 298.7
Didi Gregorius 513.0 0.9 34.20 252 237 276
Abraham Toro 323.7 1.4 34.19 190 238 85.7
Brandon Marsh 418.4 1.2 34.01 209 239 179.4
Ha-Seong Kim 349.8 1.5 33.80 173 240 109.8
Luis García 556.8 0.7 33.71 277 241 315.8
Colin Moran 582.8 -0.1 33.51 494 242 340.8
Anthony Alford 555.4 0.0 33.46 468 243 312.4
Yadier Molina 322.4 1.2 33.36 211 244 78.4
Cavan Biggio 319.3 1.4 33.14 184 245 74.3
Justin Upton 565.1 0.3 32.89 349 246 319.1
Jurickson Profar 540.1 0.3 32.83 341 247 293.1
Andy Ibáñez 563.1 1.5 32.79 175 248 315.1
Miguel Cabrera 539.9 -0.4 32.78 500 249 290.9
Willie Calhoun 451.5 0.2 32.74 359 250 201.5
Kevin Pillar 590.7 0.4 32.61 314 251 339.7
Elvis Andrus 566.4 0.9 32.57 248 252 314.4
Kyle Isbel 542.5 0.6 32.48 299 253 289.5
Kole Calhoun 522.9 1.1 32.47 227 254 268.9
Yoshi Tsutsugo 431.9 0.3 32.33 336 255 176.9
Carson Kelly 280.7 2.0 32.29 120 256 24.7
Michael Chavis 569.3 0.0 31.86 477 257 312.3
Nick Solak 494.8 0.7 31.77 284 258 236.8
Josh Naylor 587.4 0.5 31.66 306 259 328.4
Jordan Luplow 600.6 0.8 31.43 263 260 340.6
Nick Senzel 422.0 1.0 31.42 234 261 161
Alejandro Kirk 237.1 1.7 31.14 144 262 -24.9
Omar Narváez 251.8 1.7 31.04 143 263 -11.2
Nico Hoerner 506.9 1.4 30.98 182 264 242.9
Ryan Jeffers 480.5 1.6 30.97 159 265 215.5
Corey Dickerson 581.1 0.1 30.57 388 266 315.1
Dominic Smith 456.4 0.5 30.47 304 267 189.4
Santiago Espinal 557.5 0.7 30.41 273 268 289.5
Victor Robles 474.9 1.0 30.29 237 269 205.9
Jorge Mateo 472.9 0.5 30.25 312 270 202.9
Jed Lowrie 597.2 0.8 30.16 267 271 326.2
Oscar Mercado 593.3 0.7 29.78 285 272 321.3
Mike Moustakas 374.8 0.9 29.69 245 273 101.8
Travis d'Arnaud 224.4 1.4 29.59 183 274 -49.6
Aristides Aquino 585.4 0.4 29.59 333 275 310.4
Yan Gomes 442.1 1.5 29.49 166 276 166.1
Lars Nootbaar 572.6 0.2 29.24 375 277 295.6
Rougned Odor 504.9 0.6 29.18 289 278 226.9
Thairo Estrada 600.3 0.8 29.16 259 279 321.3
Luis Torrens 446.4 0.7 27.97 279 280 166.4
Kevin Newman 597.1 0.5 27.93 303 281 316.1
J.D. Davis 511.6 0.9 27.78 253 282 229.6
Alex Dickerson 597.9 0.2 27.55 357 283 314.9
Niko Goodrum 571.4 0.7 27.34 283 284 287.4
James McCann 351.5 1.4 27.26 186 285 66.5
Jake McCarthy 600.8 0.0 26.71 429 286 314.8
Roberto Pérez 545.7 1.3 26.49 195 287 258.7
Victor Reyes 592.8 0.4 26.39 316 288 304.8
Jacob Stallings 421.9 2.2 26.38 102 289 132.9
Jason Heyward 594.8 0.8 26.23 269 290 304.8
Aledmys Díaz 573.3 0.9 26.14 243 291 282.3
Stephen Piscotty 594.4 -0.2 25.66 497 292 302.4
Brett Gardner 598.6 0.6 25.65 297 293 305.6
Jorge Alfaro 454.9 0.0 25.42 474 294 160.9
Francisco Mejía 492.9 1.3 25.33 202 295 197.9
Jarren Duran 497.2 0.4 24.88 327 296 201.2
Martín Maldonado 562.4 1.2 24.43 216 297 265.4
Donovan Solano 594.0 0.4 24.18 324 298 296
Daniel Vogelbach 586.6 0.6 23.98 298 299 287.6
Austin Nola 377.8 1.5 23.41 171 300 77.8
Aaron Hicks 524.7 1.6 23.21 164 301 223.7
Brett Phillips 590.7 0.3 23.18 337 302 288.7
Tucker Barnhart 471.5 1.3 22.87 200 303 168.5
Cole Tucker 586.6 0.4 22.32 325 304 282.6
Tyler Wade 505.0 0.4 22.32 331 305 200
Manny Piña 570.7 1.2 22.05 205 306 264.7
Yu Chang 596.6 0.4 21.85 334 307 289.6
DJ Stewart 600.3 0.5 21.81 305 308 292.3
Matt Vierling 578.6 0.5 21.71 313 309 269.6
Austin Hedges 587.0 0.5 21.43 307 310 277
Keston Hiura 465.4 0.2 21.21 364 311 154.4
Trevor Larnach 565.1 0.1 21.12 390 312 253.1
Kyle Higashioka 555.9 1.0 21.04 236 313 242.9
Josh VanMeter 597.6 0.4 20.98 323 314 283.6
Kelvin Gutierrez 597.6 0.3 20.96 351 315 282.6
Dom Nuñez 586.9 0.5 20.74 311 316 270.9
Jose Barrero 478.7 1.2 20.72 206 317 161.7
Carter Kieboom 539.8 0.8 20.67 258 318 221.8
Lewin Díaz 585.3 0.3 20.62 344 319 266.3
Mitch Moreland 600.9 0.3 20.58 342 320 280.9
Clint Frazier 472.5 0.3 20.27 345 321 151.5
Taylor Walls 587.7 1.1 19.70 219 322 265.7
Matt Duffy 586.2 0.5 19.68 310 323 263.2
Jace Peterson 579.4 0.2 19.64 361 324 255.4
Jackie Bradley Jr. 589.7 0.2 19.56 379 325 264.7
Pedro Severino 522.6 0.9 19.13 244 326 196.6
Brent Rooker 597.9 0.3 18.49 348 327 270.9
Tom Murphy 539.9 1.2 18.38 204 328 211.9
Mike Brosseau 597.1 0.4 18.32 320 329 268.1
Matt Beaty 591.6 0.2 18.14 358 330 261.6
Andrew Stevenson 597.8 -0.1 18.02 482 331 266.8
Robinson Chirinos 595.5 0.6 17.93 294 332 263.5
Brian Goodwin 600.6 0.1 17.51 399 333 267.6
Asdrúbal Cabrera 599.6 0.0 17.41 425 334 265.6
Dylan Moore 509.0 0.4 17.40 326 335 174
Riley Adams 593.0 0.8 17.34 257 336 257
Sheldon Neuse 999.0 0.2 17.28 362 337 662
Albert Pujols 597.5 -0.3 17.28 499 338 259.5
Andrelton Simmons 586.5 1.1 17.09 230 339 247.5
Ehire Adrianza 999.0 0.1 16.32 403 340 659
Max Schrock 599.8 0.0 16.21 467 341 258.8
Cristian Pache 573.0 0.8 15.96 266 342 231
Christian Arroyo 594.0 0.2 15.92 354 343 251
Derek Hill 999.0 0.1 15.88 387 344 655
Austin Barnes 592.5 0.8 15.79 264 345 247.5
Jason Castro 599.2 0.8 15.55 256 346 253.2
Nick Gordon 584.3 0.4 15.49 330 347 237.3
Daz Cameron 599.3 0.2 15.45 378 348 251.3
Harold Castro 599.7 -0.1 15.33 492 349 250.7
Yonathan Daza 598.7 -0.1 15.30 480 350 248.7
Danny Santana 999.0 0.1 15.23 406 351 648
Adam Eaton 999.0 0.2 15.22 355 352 647
Jonah Heim 531.7 0.9 14.96 246 353 178.7
Curt Casali 592.1 0.9 14.65 250 354 238.1
Victor Caratini 582.5 0.8 14.63 260 355 227.5
Edward Olivares 584.8 0.3 14.07 346 356 228.8
Tomás Nido 601.0 0.2 13.81 360 357 244
Ben Rortvedt 999.0 1.0 13.72 235 358 641
Leody Taveras 573.9 0.6 13.61 287 359 214.9
Jake Marisnick 999.0 0.2 13.39 381 360 639
Khris Davis 999.0 0.0 13.23 448 361 638
Kevin Plawecki 581.7 0.8 13.07 261 362 219.7
Nomar Mazara 600.4 0.1 13.05 416 363 237.4
Jarrod Dyson 999.0 -0.2 12.97 495 364 635
Taylor Ward 600.9 0.4 12.74 322 365 235.9
Guillermo Heredia 999.0 -0.1 12.52 491 366 633
Shed Long Jr. 999.0 0.2 12.44 367 367 632
Willi Castro 563.7 0.3 12.44 352 368 195.7
Juan Lagares 999.0 0.1 12.31 417 369 630
Charlie Culberson 597.2 0.0 11.99 432 370 227.2
Harold Ramirez 574.0 0.2 11.97 373 371 203
Jon Berti 576.3 0.4 11.84 328 372 204.3
Wilson Ramos 600.9 0.6 11.62 296 373 227.9
Luis Rengifo 600.2 0.4 11.33 317 374 226.2
Shogo Akiyama 600.8 0.0 11.26 449 375 225.8
Hoy Park 600.7 0.4 11.06 315 376 224.7
Jack Mayfield 600.9 0.1 10.56 407 377 223.9
Maikel Franco 590.0 0.0 10.51 444 378 212
David Bote 597.7 0.4 10.50 319 379 218.7
Mauricio Dubón 596.2 0.3 10.46 335 380 216.2
Seby Zavala 598.5 -0.1 10.25 485 381 217.5
Hanser Alberto 600.3 0.2 10.00 356 382 218.3
Michael Perez 999.0 0.4 9.88 318 383 616
Starlin Castro 597.3 0.4 9.81 332 384 213.3
Owen Miller 600.7 0.2 9.73 365 385 215.7
Kurt Suzuki 599.5 0.2 9.73 369 386 213.5
Jose Trevino 596.6 0.3 9.67 343 387 209.6
Ronald Torreyes 600.9 0.0 9.65 453 388 212.9
Eli White 599.7 0.1 9.45 414 389 210.7
Zach McKinstry 596.2 0.1 9.37 382 390 206.2
Andrew Velazquez 600.7 0.2 8.93 368 391 209.7
Orlando Arcia 600.8 0.3 8.90 350 392 208.8
Zack Collins 598.5 0.6 8.90 302 393 205.5
Joe Panik 999.0 0.1 8.69 422 394 605
Wilmer Difo 999.0 -0.1 8.50 487 395 604
Miguel Andújar 595.5 0.0 8.46 439 396 199.5
Travis Shaw 600.5 0.1 8.39 418 397 203.5
Jake Cave 999.0 -0.1 8.36 490 398 601
Josh Reddick 999.0 -0.2 8.21 498 399 600
Ryan O'Hearn 599.8 0.0 7.99 461 400 199.8
Tim Locastro 598.6 0.1 7.86 401 401 197.6
Kyle Garlick 999.0 0.0 7.71 471 402 597
Jake Rogers 600.8 0.3 7.66 347 403 197.8
Marwin Gonzalez 600.4 0.2 7.62 376 404 196.4
Billy McKinney 599.2 0.0 7.54 438 405 194.2
Renato Núñez 599.6 0.2 7.53 380 406 193.6
Cal Raleigh 560.8 0.9 7.34 249 407 153.8
Eric Sogard 999.0 0.0 7.33 430 408 591
Jose Rojas 600.4 0.0 7.32 450 409 191.4
Emmanuel Rivera 999.0 0.1 7.25 393 410 589
Yonny Hernandez 599.8 0.3 7.20 339 411 188.8
Stephen Vogt 999.0 0.1 7.09 409 412 587
Matt Carpenter 597.1 0.0 7.07 458 413 184.1
Danny Mendick 999.0 0.1 7.04 383 414 585
Sergio Alcántara 600.5 0.2 6.86 371 415 185.5
Brock Holt 999.0 0.1 6.75 423 416 583
Billy Hamilton 597.8 -0.1 6.73 489 417 180.8
Daniel Johnson 999.0 0.0 6.63 442 418 581
Rob Refsnyder 999.0 0.0 6.49 447 419 580
Ernie Clement 999.0 0.2 6.31 370 420 579
Lewis Brinson 600.5 0.0 6.25 441 421 179.5
Andrew Knizner 598.8 0.6 6.22 295 422 176.8
Chris Owings 600.7 0.0 6.20 435 423 177.7
Tres Barrera 999.0 0.1 6.13 398 424 575
Isaac Paredes 600.3 0.4 6.10 321 425 175.3
Luis Guillorme 999.0 0.3 5.94 353 426 573
Nick Maton 999.0 0.0 5.87 434 427 572
Phil Gosselin 594.3 0.0 5.87 460 428 166.3
Rodolfo Castro 999.0 0.1 5.81 386 429 570
Rafael Marchan 999.0 0.1 5.62 392 430 569
Cam Gallagher 600.6 0.3 5.50 338 431 169.6
Travis Jankowski 999.0 0.0 5.49 427 432 567
Sandy León 999.0 0.0 5.21 443 433 566
Pablo Reyes 999.0 0.1 5.21 408 434 565
Jordy Mercer 999.0 -0.1 5.19 483 435 564
José Peraza 600.8 0.1 5.19 410 436 164.8
William Contreras 593.2 0.2 5.16 363 437 156.2
Taylor Trammell 600.9 0.1 5.11 396 438 162.9
Alex Jackson 999.0 0.2 4.90 366 439 560
Reese McGuire 599.1 0.2 4.84 372 440 159.1
Taylor Jones 999.0 0.0 4.76 436 441 558
Jahmai Jones 600.7 0.1 4.71 411 442 158.7
Roman Quinn 999.0 0.0 4.40 454 443 556
Dustin Garneau 999.0 0.1 4.26 402 444 555
Jake Bauers 600.5 -0.1 4.22 486 445 155.5
Aramis Garcia 999.0 0.0 4.19 457 446 553
José Rondón 999.0 0.0 4.13 446 447 552
Ryan McKenna 999.0 0.1 4.00 391 448 551
Andrew Knapp 600.9 0.0 3.96 462 449 151.9
DJ Peters 600.5 0.0 3.93 464 450 150.5
Matt Joyce 999.0 -0.1 3.86 484 451 548
Drew Ellis 999.0 0.1 3.79 389 452 547
Luke Williams 999.0 0.0 3.77 455 453 546
Alex Blandino 999.0 -0.2 3.74 496 454 545
Magneuris Sierra 999.0 -0.1 3.69 493 455 544
Yermín Mercedes 596.9 0.1 3.63 415 456 140.9
Jake Lamb 999.0 0.1 3.62 404 457 542
Alan Trejo 999.0 0.0 3.57 426 458 541
Isan Díaz 600.5 0.1 3.56 421 459 141.5
Evan White 600.2 0.0 3.53 437 460 140.2
Richie Martin 999.0 -0.1 3.49 479 461 538
Zack Short 999.0 0.1 3.46 385 462 537
Skye Bolt 999.0 0.2 3.32 377 463 536
Ender Inciarte 600.0 0.0 3.24 465 464 136
David Dahl 599.6 0.0 3.15 476 465 134.6
Donovan Walton 999.0 0.1 3.06 405 466 533
Jason Vosler 999.0 0.1 3.04 419 467 532
Luke Raley 999.0 0.1 3.02 420 468 531
John Nogowski 600.8 0.0 2.95 433 469 131.8
Daniel Robertson 999.0 0.0 2.91 440 470 529
Gilberto Celestino 999.0 0.1 2.82 413 471 528
Grayson Greiner 999.0 0.0 2.76 456 472 527
Delino DeShields 999.0 0.0 2.60 451 473 526
Austin Romine 999.0 0.1 2.59 395 474 525
Tucupita Marcano 600.6 0.0 2.33 428 475 125.6
Abraham Almonte 999.0 0.0 2.25 459 476 523
Rio Ruiz 999.0 0.0 2.10 452 477 522
Pat Valaika 999.0 0.0 2.05 469 478 521
Jonathan Araúz 999.0 0.0 1.85 424 479 520
Webster Rivas 999.0 0.0 1.84 470 480 519
Wyatt Mathisen 999.0 0.0 1.81 431 481 518
Ildemaro Vargas 999.0 0.0 1.80 445 482 517
Patrick Mazeika 999.0 0.1 1.78 412 483 516
Phillip Evans 999.0 0.0 1.73 478 484 515
Chad Wallach 999.0 0.0 1.49 475 485 514
Jonathan Davis 999.0 -0.1 1.49 481 486 513
Franchy Cordero 999.0 0.0 1.47 463 487 512
Justin Williams 999.0 -0.1 1.42 488 488 511
Willians Astudillo 601.0 0.0 1.37 466 489 112
Edwin Ríos 587.4 0.0 1.23 472 490 97.4
Chance Sisco 999.0 0.1 0.97 397 491 508
Albert Almora Jr. 999.0 0.1 0.81 394 492 507
Robel García 999.0 0.0 0.51 473 493 506
NA
NA
NA
NA
NA
NA
---
title: "Welcome to my 2022 Projections for Hitters 5x5"
author: "Darshan Patel"
date: "`r Sys.Date()`"
output: 
  html_notebook:
    toc: true
    toc_float: true
    number_sections: true
    theme: sandstone
    highlight: tango
    fig_caption: true
    df_print: paged
---

<html>

<p>

Projections using Hypertuned model through XGboost

</p>

<p>

All data is from [FanGraphs.](https://www.fangraphs.com/) I have no affiliation with FanGraphs, but please consider contributing to their [website](https://plus.fangraphs.com/shop/) if you found this project informative.

</p>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = 'C:/Users/Admin/Documents/Learning Python Folder1/Python Essence Training/Fantasy-Baseball/Data')
options(knitr.table.format = "html") 
options(digits=2)
options(scipen = 100)
```


# Project Scope {.tabset .tabset-pills}

## Objective

This project is designed to showcase how Using a Percentile Based Worth System values Fantasy Baseball Players through a Inning Pitched (IP) weighted projection

The Categories used for prediction valuation are year-end rankings for the following metrics:
-   HRs
-   Runs
-   RBIs
-   Batting Average
-   Stolen Bases

![](IntroChart6x6.png)


***  

# Processing the Data {.tabset .tabset-pills}

## Getting Data Into R

### Load Libraries

<p style="color:black;">

*First we need to load the packages that R needs to run the analysis*

</p>

```{r load library,message = FALSE,warning=FALSE}
library(sqldf) #SQL in R
library(skimr) #Summaries and useful for removing low % data
library(ggplot2) #Plotting Functions
library(plyr) #slightly deprecated data cleaning
library(dplyr) #slightly updated data cleaning
library(tidyverse) #tidyverse data cleaning universe
library(caret) #wrapper for creating, tuning and validating models
library(xgboost) #package for creating regression tree model
library(vtreat) # useful package for treating data before modeling 
library(Matrix)
library(Boruta)
library(mgcv)
library(moments) #for measuring skewness
library(data.table) #alternative to dplyr we use to create lags
library(pdp) #partial dependence graphs
library(vip) #variable importance 
library(grid) #put multiple plots on one grid
library(gridExtra) #additional grid functionality
library(janitor) #one function used to clean transposed data set
library(ggpubr) #for qq plot
library(tableHTML)
library(kableExtra)
```

The \# comments generally explain what additional functionality each library adds to R

### Load in Data

All data is downloaded from Fan Graphs. From this [location](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2021&month=0&season1=2021&ind=0). The data is also available on my Github [here](https://github.com/dissipation/Fantasy-Baseball). There are player level and team data sets

```{r data read-in, results= 'hide',message=FALSE}

#data read-in
Batter_data <- read_csv("FanGraphs Leaderboard_Hitting50PA.csv")

#Team datasets
FDG_Team = read_csv("FanGraphs Leaderboard_Team.csv")


#Create a prefix for all team stats that starts with T_
FDG_Team2 <- FDG_Team %>% 
  rename_with( ~ paste0("T_", .x))

```

### Checking Team Data

`str` give information about an object, while `skim` provides a customizable summary  

```{r checking team data}

#Output not shown for space
#str(FDG_Team2)

skim(FDG_Team2) %>%  
  tibble::as_tibble()
```
***  

## Understanding the Dataset

### Exploring the dataset

`skim` let's us see how the data was imported into R. Documentation can be found [here](https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html)

```{r}

#Full Dataset dimensions

skimr::skim(Batter_data) %>% 
  tibble::as_tibble() %>% 
  select(skim_type,skim_variable,complete_rate) %>% 
  filter(complete_rate >0.30) #288 Variables

#skim_type - character or numeric
#skim_variable - name of variable
#complete_rate - % of data that is not missing
#filter - only keep variables that have 30% of data populated
```

Additionally let's look at how variables vary by year to see if there are any discrepancies there  

```{r}

#It looks like one year, there were fewer games played, and there is a clear drop off in home runs
Batter_data_dist =
Batter_data %>% 
 group_by(Season) %>% 
  summarize (Games_played = max(G),
             Avg_HR= mean(HR)
             )
Batter_data_dist

ggplot(Batter_data_dist, aes(Season, Avg_HR)) +
  geom_col()+
  ggtitle("Average Home Runs by Year")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))
```

***  

## Cleaning and Creating Initial Dataset for Model

What are some issues with the data?

1.  Many of Variables, such as K%, are being read in as characters

    -   Only Team and Player Name should be characters

2.  There is spotty data coverage in some of the variables (\~Variables have less than 30% Coverage)

3.  2020 Data only includes 60 games worth of data

    -   This was a season shortened due to Covid-19

4.  Team Data needs to be appended to Batter Data by Team Name 

***  

### Cleanly Changing all Variables that are characters to numeric.  
There are several ways to do this, we will identify the variables we want to change that are mis-identified. `parse_number` can be used to pull numbers from these variables. Additional ways to tackle this can be found [here](https://stackoverflow.com/questions/8329059/how-to-convert-character-of-percentage-into-numeric-in-r)

```{r}

#Select Column names that are characters but not Team or Name, These should be percentages
Batter_data_chars_to_convert <- Batter_data %>% 
  select_if(is.character)%>% select(-Team,-Name) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Note : There are additional ways to do this, this is just one solution


#We can exclude the variables we converted and reintroduce them
Batter_data_num <- Batter_data %>% select(-colnames(Batter_data_chars_to_convert))

Batter_data2 = cbind(Batter_data_num,Batter_data_chars_to_convert) %>% 
  select (colnames(Batter_data)) %>%  #preserve original order 
  dplyr::rename(flyball_perc = `FB%...46`,fastball_perc = `FB%...73`) #rename two ambiguous columns
  
skim(Batter_data2) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()


#Logical variables are R's best guess, in our case they are all NA's and will be removed
```

The same can be done for the Team Data that is loaded  


```{r}

#Select Column names that are characters but not Team or Name, These should be percentages
FDG_Team2_chars_to_convert <- FDG_Team2 %>% 
  select_if(is.character)%>% select(-T_Team) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Keep in mind, parse number may make actual characters into numerical variables so carefully check your data before using

#We can exclude the variables we converted and reintroduce them
FDG_Team2_num <- FDG_Team2 %>% select(-colnames(FDG_Team2_chars_to_convert))

FDG_Team3 = cbind(FDG_Team2_num,FDG_Team2_chars_to_convert) %>% 
  select (colnames(FDG_Team2)) %>%  #preserve original order
dplyr::rename(T_flyball_perc = `T_FB%...45`,T_fastball_perc = `T_FB%...72`) 

skim(FDG_Team3) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()
```
***  


### Filtering Data with Low Coverage    
I choose 30% coverage of data necessary but this can be adjusted up or down. This will also get rid of columns that are all `NA`.  
```{r}

# Keep variables with enough values (Need 30% data coverage rate here)
Player_cols_to_keep =
skim(Batter_data2) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)

#Transpose Rows to get column names as skim melts the data
Player_cols_to_keep_transpose = t(Player_cols_to_keep) 

#extract the colnames we would like to keep
Player_cols_to_keep = colnames(janitor::row_to_names(Player_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
Batter_data3 = Batter_data2 %>% 
  select(one_of(Player_cols_to_keep)) 
```


*Repeat the process for Team Variables*
```{r}
Team_cols_to_keep =
skim(FDG_Team3) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)


#Transpose Rows to get column names as skim melts the data
Team_cols_to_keep_transpose = t(Team_cols_to_keep) 

#extract the colnames we would like to keep
Team_cols_to_keep = colnames(janitor::row_to_names(Team_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
FDG_Team4 = FDG_Team3 %>% 
  select(one_of(Team_cols_to_keep)) 



```



***  

### Creating Variables Normalized by Year  
Some Variables will need to be normalized by Innings_Pitched (IP) if they aren't a percentage already. Remaining Variables are percentages or indices so will not need to be transformed. The full data dictionary for these variables can be found on FanGraph's website [here.](https://library.fangraphs.com/pitching/complete-list-pitching/) for pitching variables and [here.](https://library.fangraphs.com/offense/offensive-statistics-list/) for hitting variables.  
```{r}

Batter_data4 = Batter_data3 %>% 
  mutate( #create new variables based on existing variables
    H_PA = H/PA,
    x1B_PA = `1B`/PA, #note: R can't have variables start with a number
    x2b_PA = `2B`/PA,
    x3b_PA = `3B`/PA,
    HR_PA = HR/PA,
    R_PA = R/PA,
    RBI_PA = RBI/PA,
    BB_PA = BB/PA,
    IBB_PA = IBB/PA,
    SO_PA=SO/PA,
    HBP_PA=HBP/PA,
    SF_PA=SF/PA,
    SH_PA=SH/PA,
    GDP_PA= GDP/PA,#ground into double play
    SB_PA=SB/PA,
    CS_PA=CS/PA,
    GB_PA = GB/PA,   #Groundballs
    FB_PA =  FB/PA,  #FlyBalls
    LD_PA = LD/PA,   #LineDrives
    IFFB_PA = IFFB/PA,  #Infield Fly balls
    Pitches_PA= Pitches/PA,
    Balls_PA= Balls/PA,
    Strikes_PA= Strikes/PA,
    IFH_PA= IFH/PA,
    BU_PA= BU/PA,
    BUH_PA= BUH/PA,
    PH_PA= PH/PA,
    Barrels_PA= Barrels/PA,
    HardHits_PA= HardHit/PA
  ) %>% select(-(H:CS),-(GB:BUH),-PH,-Barrels,-HardHit,-Events) #Drop the old variables

#skim(Batter_data4) %>% as_tibble()


```

*Repeat the process for Team Variables*
```{r}

FDG_Team5 = FDG_Team4 %>% 
  mutate( #create new variables based on existing variables
    T_H_T_PA = T_H/T_PA,
    T_x1B_T_PA = T_1B/T_PA, #note: R can't have variables start with a number
    T_x2b_T_PA = T_2B/T_PA,
    T_x3b_T_PA = T_3B/T_PA,
    T_HR_T_PA = T_HR/T_PA,
    T_R_T_PA = T_R/T_PA,
    T_RBI_T_PA = T_RBI/T_PA,
    T_BB_T_PA = T_BB/T_PA,
    T_IBB_T_PA = T_IBB/T_PA,
    T_SO_T_PA=T_SO/T_PA,
    T_HBP_T_PA=T_HBP/T_PA,
    T_SF_T_PA=T_SF/T_PA,
    T_SH_T_PA=T_SH/T_PA,
    T_GDP_T_PA= T_GDP/T_PA,#ground into double play
    T_SB_T_PA=T_SB/T_PA,
    T_CS_T_PA=T_CS/T_PA,
    T_GB_T_PA = T_GB/T_PA,   #Groundballs
    T_FB_T_PA =  T_FB/T_PA,  #FlyBalls
    T_LD_T_PA = T_LD/T_PA,   #LineDrives
    T_IFFB_T_PA = T_IFFB/T_PA,  #Infield Fly balls
    T_Pitches_T_PA= T_Pitches/T_PA,
    T_Balls_T_PA= T_Balls/T_PA,
    T_Strikes_T_PA= T_Strikes/T_PA,
    T_IFH_T_PA= T_IFH/T_PA,
    T_BU_T_PA= T_BU/T_PA,
    T_BUH_T_PA= T_BUH/T_PA,
    T_PH_T_PA= T_PH/T_PA,
    T_Barrels_T_PA= T_Barrels/T_PA,
    T_HardHits_T_PA= T_HardHit/T_PA
  ) %>% select(-(T_H:T_CS),-(T_GB:T_BUH),-T_PH,-T_Barrels,-T_HardHit,-T_Events) #Drop the old variables


#skim(FDG_Team5) %>% as_tibble()


```

***  

### Creating Lagged Variables  

There are several ways to lag a dataset **BY GROUP**.\
\* `Dplyr` way is [here.](https://statisticsglobe.com/create-lagged-variable-by-group-in-r).\
\* The `data.table` (the method used below) is [here.](https://stackoverflow.com/questions/26291988/how-to-create-a-lag-variable-within-each-group)  
```{r}
#Note we will only be lagging the player level data, as the previous year's team performance shouldn't impact current performance


#Order the dataset by lag columns
Batter_data5 =  arrange(Batter_data4, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_batter = data.table(Batter_data5)

#designate columns to lag - which is all of them
cols1 = colnames(Batter_data5)
anscols = paste("lag", cols1, sep="_")
DT_batter[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

Batter_data6 = as.data.frame(DT_batter) %>% select(-lag_playerid, -lag_Team, -lag_Season, -lag_Age,-lag_Name)

ncol(Batter_data5) #287 - no lags
ncol(Batter_data6) #574 - lagged data ~ (287 * 2)-5

```

***    

### Merging Team and Player Data  
We can use either the `merge` function or the SQL functionality provided by the `sqldf` package to join the lagged player level data to the Team level data

```{r}

df_batting_init = sqldf(
  "
  select a.*, b.*
  from Batter_data6 a
  left join FDG_Team5 b
  on a.Team = b.T_Team and a.Season = b.T_Season
  
  "
)  %>% select(-T_Team,-T_Season,T_Age,T_G,T_AB)# Unncessary Team Variables


nrow(df_batting_init) - nrow(Batter_data6) #check if any rows are duplicated


```


***  


# Creating Rankings for Players Based On Percentiles {.tabset .tabset-pills}

We can use Percentile based ranking to get rankings for players from the 2021 season.

## Worth of each stat

### Calculating past performance

Each player goes from a 0% to 100% on each percentile stat that is used for creating a scoring opportunity. Data is not yet normalized by PA as certain stats such as HRs and SBs will be worth more when we do.\

```{r}

#Categories I include are:
#Runs (R), Home Runs (HR), Runs Batted In (RBI), Stolen Bases (SB), Batting Average (AVG)
df_batting_init2 =  df_batting_init %>%
#  arrange(player_id,year) %>% 
  group_by(Season) %>% 
  mutate(
    Runs_share = order(order(rank(R_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     HR_share = order(order(rank(HR_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     RBI_share = order(order(rank(RBI_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     SB_share = order(order(rank(SB_PA,ties.method = 'average'),decreasing = FALSE))/n(),
     AVG_share = order(order(rank(AVG,ties.method = 'average'),decreasing = FALSE))/n(),
    OPS_share = 0,
    Worth = Runs_share+HR_share+RBI_share+SB_share+AVG_share+OPS_share
    ) %>% 
  ungroup() 

```

Chart of the Distribution of initial percentiles  
As the chart below shows, the data is roughly normal.
```{r}

skewness((df_batting_init2$Worth))

ggplot2::qplot(df_batting_init2$Worth, main="Total Dataset") + geom_histogram(colour="black", fill="lime green")

min(df_batting_init2$Worth)

max(df_batting_init2$Worth)

ggpubr::ggqqplot(df_batting_init2$Worth)

shapiro.test(df_batting_init2$Worth)
```


***  
## 2021 Player Rankings - Per PA performance

### 2021 Player Rankings - Top Worth Player

There are per PA rankings. Players like Byron Buxton which had a great per PA score but can't stay healthy for a season will be adjusted down.

```{r,warning=FALSE}

options(digits=2)

df_batting_init2021 =
df_batting_init2 %>% 
  group_by(Name) %>% 
  filter(Season == 2021) %>% 
  arrange(desc(Worth)) %>% 
  select(Name,Runs_share,HR_share,RBI_share, SB_share,OPS_share,AVG_share,Worth)


df_batting_init2021 %>%
  filter (Worth>3.9) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)
```


***  


# Creating Model File {.tabset .tabset-pills}  

## Additional Data Prep  

### Remove Variables which are based off current hitting numbers  

Not all variables can be used for predictive modeling.  Variables that go into the percentile ranking or are non-normalized metrics created after the fact (such as `WAR` - Wins above Replacement or `RE24`) should be removed. However, metrics that are normalized by a per pitch basis (such as `HR/FB%+`) can remain as we expect batters to have similar performance in these metrics one year out.  

```{r}
#Creating a new dataset to keep original intact
df_batting_init3 = df_batting_init2
```

Lagged Percentile (`_share`) Variables can be used for predictive modeling. However since these variables were created for the Worth metric they must also be removed for modeling purposes.  

```{r}

#Order the dataset by lag columns
df_batting_init4 =  arrange(df_batting_init3, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_batter2 = data.table(df_batting_init4)

#designate columns to lag - which is all of them
cols1 = (c('Runs_share','HR_share','RBI_share', 'SB_share','OPS_share','AVG_share','Worth'))
anscols = paste("lag", cols1, sep="_")
DT_batter2[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

#names(DT_batter2)

df_batting_final = as.data.frame(DT_batter2) %>% 
  select(-c(Runs_share,HR_share,RBI_share, SB_share,OPS_share,AVG_share))%>% 
  select(-(G:AVG),-(OBP:BABIP),-(wOBA:Dol),-(`wRC+`:REW),-(`WPA/LI`),-(wFB:wSF),-BsR,-(Def:wGDP),-(`wCH (pi)`:`wCH/C (pi)`),-(`AVG+`),-(`OBP+`:`BABIP+`),-(H_PA:PH_PA)) %>% select (-Name)


```



### Creating Training/Test Split  
We split the data into Training Data (which is used to create the model) and test data (which is used to validate the model)   
```{r}

set.seed(15674)  # For reproducibility
# Create index for testing and training data
inTrain <- createDataPartition(y = df_batting_final$Worth, p = 0.80, list = FALSE)
# subset pitching data for training
tr_2021 <- df_batting_final[inTrain,]
# subset the rest to test and validate trained model
te_2021 <- df_batting_final[-inTrain,]

nrow(tr_2021)/nrow(df_batting_final) #check if split is 0.8

```

### Treat Missing Data by Imputing Mean Value  
Vtreat Package in R is excellent for treating data before using for modeling. Additional documentation can be found [here.](https://winvector.github.io/vtreat/index.html)  
*Note: The treatment plan also fixes variables names  like`HR/FB%+` (which R doesn't always handle the best) to `HR_slash_FB_percent_plus_`*  
```{r}
treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = tr_2021, # training data
  varlist = colnames(tr_2021) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages

#clean stands for cleaned numerical variable, isBAD indicates that a value replacement has occurred (which indicates a missing value in this case), and lev is a binary indicator whether a particular value of that categorical variable was present.  

#### Checking Scoreframe

score_frame <- treat_plan_2021$scoreFrame %>% 
  select(varName, origName, code)

head(score_frame)


tr_treated_2021 <- vtreat::prepare(treat_plan_2021, tr_2021)
te_treated_2021 <- vtreat::prepare(treat_plan_2021, te_2021)


Total_dataset1_untreat = as.data.frame(DT_batter2) %>% select(-Name)

treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = Total_dataset1_untreat, # training data
  varlist = colnames(Total_dataset1_untreat) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages


total_treated_2021_hitting <- vtreat::prepare(treat_plan_2021, Total_dataset1_untreat)



#tr_treated = tr
#te_treated = te

dim(tr_treated_2021) #note there are dummies for each player and team

```


***    


### Check Distribution of Training Population  
The population used for Training should be indicative of Total Population
```{r}

ggplot2::qplot(tr_treated_2021$Worth, main="Training Set") + geom_histogram(colour="black", fill="limegreen") + theme_bw()

skewness(tr_treated_2021$Worth) #The skewness is the same as the overall
```


# Running XGboost Model {.tabset .tabset-pills} 
To keep things simple with modeling, we’ll turn the training data into simple input variables for `caret::train`, dropping the response variable and converting the data frame to a matrix. Documentation for this approach to XGboost can be found [here.](https://www.kaggle.com/pelkoja/visual-xgboost-tuning-with-caret)    

## Tuning the Model

### Initial Non-Tuned Model
Break the data set into x and y inputs with x being a matrix  
```{r}
input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>%                      
   select(!ends_with ("_isBAD")))

input_y <- tr_treated_2021$Worth

```

XGBoost with Default Hyperparameters    
The Variable Importance (`caret::varImp(xgb_base_2021, scale = F  )`) from the caret package shows the contribution of each variable to the initial model. As you can see SLG_plus_ (SLG+) takes up much of the importance as it is derived from SLG (one of the key contributors to Worth). These types of variables will be removed during variable selection in the next step.  
*XGBoost documentation can be found for more general models [here.](https://www.kaggle.com/code/rtatman/machine-learning-with-xgboost-in-r/notebook)*

```{r}

#Defaults for xgboost model
grid_default <- expand.grid(
  nrounds = 100,
  max_depth = 6,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

#This is a blank train_control set, this will be updated after
train_control <- caret::trainControl(
  method = "none",
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )




```

## Further Variable Selection

### Remove redundant and highly correlated variables

Selection Removal Step 1: Check for high correlations\
Normally, this step is done early, but those steps were reserved for preparing the data

```{r}

dep_cor1 <- t(as.data.frame(cor(tr_treated_2021[ , colnames(tr_treated_2021) != "Worth"],
                tr_treated_2021$Worth)))
dep_cor1 <-
as.data.frame(t(as.data.frame(dep_cor1)%>% 
  select(!starts_with("lag")) %>% #remove lag variables
  select(!contains("_isBAD")))) 

dep_cor1 <- tibble::rownames_to_column(dep_cor1,"VARIABLES")%>% #remove indicators for missing data
  filter(V1 > 0.70|V1 < -0.5)

dep_cor1

dep_cor2 <- colnames(row_to_names(t(dep_cor1),row_number = 1))



```
Let's Remove variables with high correlation to worth metric (such as `wFB/C`)

```{r}

input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>% #Remove dependent variable
     select (-all_of(dep_cor2) ) %>%      
select(!ends_with ("_isBAD"))) #indicator variable for missing data

input_y <- tr_treated_2021$Worth





```

Run the model on the new dataset to make sure the variable importances look fine
```{r}

#Note Training parameters were set in initial model set up
xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )


```


## Model with new data  

### Tuning All Hyperparameters
A tune grid allows us to test a large amount of hyper-parameters and find the model with the lowest RMSE for predictions.   
However, The more values you want to test and the greater the amount of Cross-Fold Validations (`method = "cv"`), the greater the computational time it will take. More information on the specific parameters can be found [here.](https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/)

```{r}

# maximum number of trees
nrounds <- 1000

# note to start nrounds from 200, as smaller learning rates result in errors so
# big with lower starting points that they'll mess the scales
tune_grid <- expand.grid(
  nrounds = seq(from = 100, to = nrounds, by = 50),
  eta = c(0.01, 0.025, 0.05, 0.1),
  max_depth = c(2, 4, 6, 8),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

tune_control <- caret::trainControl(
  method = "cv", # cross-validation
  number = 5, # with n folds 
  ## Note this was # out in the original code
  #index = createFolds(tr_treated$Id_clean), # fix the folds
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)



```

*Running the initial tuning model*  
```{r}
#Note I will be timing these runs to give an estimate on how long this model takes to run
start_time <- Sys.time()

xgb_tune_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid,
  method = "xgbTree",
  verbose = FALSE
  ,verbosity = 0
)

end_time <- Sys.time()

end_time - start_time

```

*Tuning Plot and Variable Importance*
```{r}
varImp(xgb_tune_2021, scale = F  ) 


# helper function for the plots
tuneplot <- function(x, probs = .90) {
  ggplot(x) +
    coord_cartesian(ylim = c(quantile(x$results$RMSE, probs = probs), min(x$results$RMSE))) +
    theme_bw()
}

tuneplot(xgb_tune_2021)
```

### Fine Tuning Model  
#### Second Tuning: Maximum Depth and Minimum Child Weight  
After fixing the learning rate to 0.1 and we’ll also set maximum depth to 3 +-1 (or +2 if max_depth == 2) to experiment a bit around the suggested best tune in previous step. Then, well fix maximum depth and minimum child weigh

```{r}
tune_grid2 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = ifelse(xgb_tune_2021$bestTune$max_depth == 2,
    c(xgb_tune_2021$bestTune$max_depth:4),
    xgb_tune_2021$bestTune$max_depth - 1:xgb_tune_2021$bestTune$max_depth + 1),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = c(1, 2, 3),
  subsample = 1
)

xgb_tune2_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid2,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune2_2021)

xgb_tune2_2021$bestTune

varImp(xgb_tune2_2021, scale = F  ) 
```

#### Third Tuning: Column and Row Sampling

```{r}

tune_grid3 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = 0,
  colsample_bytree = c(0.4, 0.6, 0.8, 1.0),
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = c(0.5, 0.75, 1.0)
)

xgb_tune3_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid3,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune3_2021, probs = .95)

xgb_tune3_2021$bestTune

varImp(xgb_tune3_2021, scale = F  ) 
```

#### Fourth Tuning: Gamma  
Next, we again pick the best values from previous step, and now will see whether changing the gamma has any effect on the model fit:
```{r}
tune_grid4 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = c(0, 0.05,0.1, 0.2,0.4, 0.5, 0.7, 0.9, 1.0),
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)

xgb_tune4_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid4,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune4_2021)

xgb_tune4_2021$bestTune

varImp(xgb_tune4_2021, scale = F  ) 
```

#### Fifth Tuning: Reducing the Learning Rate  
Now, we have tuned the hyperparameters and can start reducing the learning rate to get to the final model:  

```{r}
start_time <- Sys.time()

tune_grid5 <- expand.grid(
  nrounds = seq(from = 100, to = 10000, by = 75),
   eta = c(0.01, 0.015, 0.025,0.035, 0.05,0.75, 0.1),
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = xgb_tune4_2021$bestTune$gamma,
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)



xgb_tune5_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid5,
  method = "xgbTree",
  verbose = TRUE
)

#tuneplot(xgb_tune5_2021)

end_time <- Sys.time()

end_time - start_time

xgb_tune5_2021$bestTune

varImp(xgb_tune5_2021, scale = F  ) 
```


#### Fitting Final Model

```{r}

(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))

varImp(xgb_model_2021, scale = F  ) 

```

### Model Performance  


#### Checking Model on Test Split Data  

```{r}


y_pred_test <- predict(xgb_model_2021, data.matrix(te_treated_2021))

test_stats= cbind((te_treated_2021$Worth),y_pred_test)

test_statsR2 = cor(test_stats[,1],test_stats[,2])^2

print(test_statsR2)


y_pred_train <- predict(xgb_model_2021, data.matrix(tr_treated_2021))

train_stats = cbind((tr_treated_2021$Worth),y_pred_train)

train_statsR2 = cor(train_stats[,1],train_stats[,2])^2

print(train_statsR2)

#test dataset
x <- select(te_treated_2021, -Worth)
y <- (te_treated_2021$Worth)

(xgb_model_rmse <- ModelMetrics::rmse(y, predict(xgb_model_2021, newdata = x)))

holdout_x <- select(tr_treated_2021, -Worth)
holdout_y <- tr_treated_2021$Worth

(xgb_model_rmse <- ModelMetrics::rmse(holdout_y, predict(xgb_model_2021, newdata = holdout_x)))


```

#### Graphical Representation of Model   


```{r}

ggplot2::ggplot() +
  aes(x = test_stats[,1], y = test_stats[,2]) +
  geom_jitter() +
  xlab("Predicted Values") +
  ylab("Actual Values") +
  ggtitle("Results of Hitting Model on Test Data")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))+
  geom_smooth(method = "lm")
```


# Creating 2022 Projections from Model  {.tabset .tabset-pills} 


## Re-fit model for Important Variables
Now that we have an acceptable model, we can use it to create projections for how well we think players should do in 2022 based on their hitting statistics in 2021. First let's reduce

1. Only keep variables with high enough importance in model  

```{r}


vip(xgb_model_2021, num_features = 30)  # 10 is the default, 30 gives a visual on the top 30 most important features of the model

unscalevi = vi(xgb_model_2021, method="model") #shows the numbers behind the plot

unscalevi$Importance_perc = with(unscalevi,Importance/sum(Importance)) #adds percentages 

unscalevi # importance by variables

variables_to_keep_2021 = subset(unscalevi, Importance_perc > 0.0010) %>% select(Variable) #Keep Variables that explain at least a small amount [0.1%] of the model. This is a low threshold for inclusion ,but you can adjust this

variables_to_keep_2021b = t(variables_to_keep_2021)

variables_to_keep_2022 = colnames(row_to_names(variables_to_keep_2021b,row_number = 1))

tr_treated_2022 = tr_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),starts_with("Team_lev_x_")) #keep modeled important variables along with team indicator variables

te_treated_2022 = te_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),starts_with("Team_lev_x_"))

input_x_2022 = as.matrix(select(tr_treated_2022, -Worth))

input_y_2022 = tr_treated_2022$Worth



```

2. Re-fit model with reduced variable scope  

```{r}


(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2022 <- caret::train(
  x = input_x_2022,
  y = input_y_2022,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))


vip(xgb_model_2022, num_features = 30)

unscalevi24 = vi(xgb_model_2022, method="model")

unscalevi24$Importance_perc = with(unscalevi24,Importance/sum(Importance)) 

unscalevi24

save(xgb_model_2022,file = '2022_Hitting5x5_Model.Rdata')

hitting5x5 = xgb_model_2022

hittinginput = input_x_2022

#For anything above breaking_IP we need to create projection table by age or age bucket

#write_csv(unscalevi24,"unscalevi24.csv")

```

------------------------------------------------------------------------
# 2022 Projections Full  

First let's prepare a file for predicting based on our model object

```{r}


variableslag5xb= row_to_names(as.data.frame(t(variables_to_keep_2022)),row_number = 1)  %>% select (starts_with("lag"))

variables_nolag5xb = (owmr::remove_prefix(variableslag5xb,"lag" , sep = "_"))

Data_Predict_2022a5xb = total_treated_2021_hitting %>% select (one_of(colnames(variables_nolag5xb)),Season,playerid)

colnames(Data_Predict_2022a5xb) <- paste0("lag_", colnames(Data_Predict_2022a5xb))

Data_Predict_2022b5xb = total_treated_2021_hitting %>% select (one_of(colnames(variables_nolag5xb)))
colnames(Data_Predict_2022b5xb) = colnames(variableslag5xb)

variables_to_keep_2022_nolag5xb = total_treated_2021_hitting %>% select(one_of(variables_to_keep_2022),Season,playerid,starts_with("Team_lev_x_"))%>% select(-one_of(colnames(Data_Predict_2022b5xb)))


Data_predict_20225xb = sqldf(
  "
  select a.*,b.* from
  Data_Predict_2022a5xb a,
  variables_to_keep_2022_nolag5xb b
  on b.playerid = a.lag_playerid
  and b.Season = a.lag_Season
  "
) %>% select(-lag_playerid,lag_Season) %>%
  filter(Season == 2021) %>% 
  select(one_of(variables_to_keep_2022),starts_with("Team_lev_x_"))



```

------------------------------------------------------------------------

## Create Predictions for Model

### Run Projections on Players who Played in 2021

This is the raw prediction score per IP for each pitcher

```{r}

hitting_predictions5xb = as.data.frame(predict(xgb_model_2022,Data_predict_20225xb))

names(hitting_predictions5xb) = c("Predict_Score")

Data_predict_2022_w_hitting_Predictions5xb = cbind(Data_predict_20225xb,hitting_predictions5xb) %>% select(playerid,Predict_Score)

head(Data_predict_2022_w_hitting_Predictions5xb)

```

------------------------------------------------------------------------


```{r}

Latest_2022_hittingdata_FP = read_csv("FanGraph_Fantasy_Baseball_Hitting.csv")

Latest_2022_hittingdata_FP


```




------------------------------------------------------------------------

```{r, warning = False}


hitting_Data_NonAdj_Projections5xb = sqldf(
  "
  select a.*,b.Predict_Score
  from Latest_2022_hittingdata_FP a 
  left join 
  Data_predict_2022_w_hitting_Predictions5xb b
  on a.playerid = b.playerid
  "
) %>% filter(ADP<370 | is.na(Predict_Score)==F)


hitting_Data_Adj_Projections5xb =
hitting_Data_NonAdj_Projections5xb %>% 
  mutate(
    Avg_PA = 300,
    AdjPredict_Score_raw = ifelse(is.na(Predict_Score),NA,Predict_Score*(PA/Avg_PA)),
    max_predscore= max(AdjPredict_Score_raw,na.rm = T),
    AdjPredict_Score = ifelse (is.na(AdjPredict_Score_raw),NA,AdjPredict_Score_raw *100/max_predscore),
    WAR_rank = order(order(rank(WAR,ties.method = 'average'),decreasing = TRUE)),
    AdjPredict_Score_Rank = order(order(rank(AdjPredict_Score,ties.method = 'average'),decreasing = TRUE))-sum(is.na(AdjPredict_Score)),
        Ranks_Above_ADP = ADP - AdjPredict_Score_Rank
  ) %>% select (Name,ADP,WAR, WAR_rank,AdjPredict_Score ,AdjPredict_Score_Rank,Ranks_Above_ADP)


  

ggplot2::qplot(hitting_Data_Adj_Projections5xb$AdjPredict_Score, main="Predictions") + geom_histogram(colour="black", fill="grey") + theme_bw()


```

------------------------------------------------------------------------

# 2022 Projections Full

## Table of hitting Projections (Players who Didn't Play in 2021 - Recieve an NA)

AdjPredict_Score are normalized to 100

```{r}

tableexport =
hitting_Data_Adj_Projections5xb %>%
  arrange (ADP,WAR) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)

save_kable(tableexport,file = "hitting5x5.html")

#tableexport



```

This is a better formatted Table

```{r , warning=FALSE}


ft_dt <- hitting_Data_Adj_Projections5xb[1:nrow(hitting_Data_Adj_Projections5xb), 1:ncol(hitting_Data_Adj_Projections5xb)] %>% 
  filter(AdjPredict_Score_Rank>0)%>%  arrange((AdjPredict_Score_Rank))

ft_dt$ADP <- color_tile("white", "red")(ft_dt$ADP)

ft_dt$WAR <- color_bar("lightblue")(ft_dt$WAR)

ft_dt$AdjPredict_Score<- color_bar("lightblue")(ft_dt$AdjPredict_Score)

ft_dt$WAR_Rank <- color_tile("green","orange")(ft_dt$WAR_rank)

ft_dt$Predict_Rank <- color_tile("green","orange")(ft_dt$AdjPredict_Score_Rank) 


ft_dt$Ranks_Above_ADP <- 
  ifelse(
  ft_dt$Ranks_Above_ADP < 0,
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "red", italic = T),
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "green", italic = T)
)


ft_dt2 <- ft_dt[c("Name", "ADP", "WAR", "AdjPredict_Score", "WAR_Rank","Predict_Rank","Ranks_Above_ADP")]



table_export = 
kbl(ft_dt2, escape = F) %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T) %>%   column_spec(6, width = "3cm") %>%
  add_header_above(c(" ", "Scores" = 3, "Ranks" = 2," "))

save_kable(table_export,file = "Hitting5x5_updated.html")
  
table_export  






```







</html>
